Refactoring my personal Git repository

People usually talk about refactoring when they’re talking about code. Refactoring generally involves reorganizing and restructuring code so that it maintains the same external functionality, but is better in some non-functional way. In most cases refactoring results in code that is better structured, easier to read and understand and on the whole easier to work with. Now that my exams are over, I decided to undertake a refactoring of my own. But I didn’t refactor code, but rather my entire personal source code repository.

About a year ago I started keeping all my files under version control. I had two Subversion repositories, one for my code and another for non-code related files (mostly school papers). A few months ago I moved from Subversion to Git, but my workflow and repository setup was essentially the same. When I moved to Git, I had a chance to change my repo structure, but I decided to keep it. The single repo would serve as a single storage unit for all my code. Development for each project would take place on separate branches which would be routinely merged back into the main branch. The files in the repo were divided into directories based on the programming language they were written in. Probably not the most scientific classification scheme, but it worked well enough.

Fast forward a few months and things aren’t quite as rosy. It turns out that having everything in one repo isn’t really a good idea after all. The single most significant reason is that the history is a complete mess. Looking back at the log I have changes to my main Python project mixed in with minor Emacs configuration changes that I made as well as any random little experiment that I did and happened to commit. Not very tidy. Secondly, using a separate branch for each project didn’t quite work. I’d often forget which branch I had checked out and start hacking on some random thing. If I was lucky I could switch branches before committing and put the changes where they belonged. If I was unlucky, I was faced with the prospect of moving changes between branches and cleaning up the history. Not something I enjoyed doing. Finally, organization by language wasn’t a good scheme especially since I took a course in programming languages and wanted to save the textbook exercises for each language. The result is that now I have a number of folders with just 2-3 files in them and I won’t be using those languages for a while. More importantly, getting to my important project folders meant digging down 3-4 levels down from my home directory.

I decided last week that things had to change. I needed a new organization system that satisfies the following requirements:

  1. Histories of my main projects are untangled.
  2. My main projects stand out clearly from smaller projects and random experiments.
  3. If I start a small project and it gets bigger it should be easy to give it main project status.
  4. An archive for older projects that I won’t be touching again (or at least not for the foreseeable future).
  5. Some way to keep my schoolwork separate from the other stuff.
  6. Everything is version controlled and I should be able to keep old history.

I’ve used a combination of Git’s repository splitting functionality and good old directory organization to make things cleaner. Everything is still tucked into a top-level src directory, but that’s where the similarities with my old system end. Each major project is moved to its own repo. Since I already had each major project in its own subdirectory, I could use Git’s filter-branch command to cleanly pull them out, while retaining history. Every active project gets its own subdirectory under ~/src which has a working copy of the repo. There is a separate archive subdirectory which contains the most recent copy of the projects that I’ve designed to file away. I technically don’t need this since the repositories are stored on my personal server, but I like having all my code handy.

I pooled together all my experimentation into a single repo called scratch. This also gets its own subdirectory under src. It currently holds a few simple classes I wrote while trying out Scala, some assembly code and a few Prolog files . My schoolwork also gets a separate repo and subdirectory. This contains code for labs in each class as well as textbook exercises (with subdirectories for each class and book). Large projects get their own repo and aren’t part of this schoolwork repo. Since I’m on break they’re all stashed under archive.

The process of actually splitting the repo into the new structure was smooth for the most part. I followed the steps outlined by this Stack Overflow answer to extract each of my main projects into its own repo. I cloned my local repo to create the individual repos but I still had setup remotes for each of them on my server. I followed a really good guide to setup the remotes, but first I had to remove the exiting remotes (which pointed to the local repo which I had cloned from). A simple git remote rm origin took care of that.

Things started to get a little more complicated when it came to extracting things that were spread out (and going into scratch). I wasn’t sure if filter-branch could let me do the kind of fine-tuned extraction and pooling I wanted to do. So I decided instead to create a scratch directory in my existing repo and then make that into a separate repo. I used the same process for extracting code that would go into my schoolwork repo.

The whole process took a little over 2 hours with the third Pirates of the Caribbean movie playing at the same time. I’m considering doing the same thing wiht my non-code repo, though I’ll need to think out a different organization structure for that. Things were made a lot easier and faster by the two guides I found and now that I have a good idea of what needs to be done, I’ll probably have an easier time next time around. I’ve come to learn a little more about the power and flexibility of Git. I’m still think I’m a Git newbie, but at least I know one more trick now. If any of you Git power users have any suggestions or alternate ways to get the same effect, do let me know.

Advertisements

A week with Git

It’s been almost a week since I moved to Git for my version control instead of Subversion. It’s been an interesting experience with some amount of learning to do. There are a number of extra things to keep in mind when using Git, but as time goes by, I’m quickly becoming used to them and they don’t seem bothersome at all.

The first major difference is going from centralized to distributed. I use version control to keep my files in sync across multiple computers as well. With Subversion a commit would save my changes and also push them to my remote server. With Git, that is now two separate processes. I would have liked to be able to have them both as a single step, but honestly, it really isn’t much of a problem. I’ve become used to doing a git push just before I leave my computer.

Once I got past that, the next thing that I had to deal with was the fact that a commit even to the local repo is actually two separate operations. You first have to add any changes made to something called the Index. It’s easiest to think of the Index as a sort of staging ground for your changes. You can selectively add files to the Index, building up a coherent set of changes and then commit them together. The Index can take some getting used to, but it is easily one of Git’s killer features. Often enough, you end up editing a bunch of completely different files. They could be files in different projects, or files in the same project that do totally different. When it comes time to commit, you’d like to be able to commit the different changes separately so that you can track exactly what changes you made by project. The Index lets you do exactly that. In fact, git will actually let you select which set of changes to a particular file you want to add to Index in a given add operation if you do a git add --patch filename . This gives you an unprecedented level of fine-grained command over how the change history of your files are saved. I haven’t had a chance to use this level of power yet, but as the semester progresses and I get involved in large project, i don’t think it’ll be long before Git saves the day. See this post for a more detailed discussion of how useful Git’s Index is.

Even if all the changes to the change history could be made only in the Index, that would be an incredible amount of power. But Git’s power stretches beyond the Index to the commit operation itself. TheĀ  git commit --amend operation pushes the changes in the Index into the previous commit which comes in handy when you forget a change, or realize that you need to change something else. And if that isn’t enough, the git rebase --interactive command basically hands you a nuclear-powered swiss army chainsaw with which you can go to town on your repository. Needless to say, this kind of power isn’t too be used without some amount of care and consideration.

Git’s control over committing is impressive, but equally impressive is Git’s branching capabilities. In Subversion, a branch means copying the current files to another subdirectory in the same repository. This means that all branches are visible at all times and that anyone checking out the full repository will get all branches. This isn’t always what you want. Sometimes you don’t everyone to see your changes until they’re ready for prime time and you might want new developers to get the production branch and then later pull additional branches as needed. Since Git is in many ways a filesystem that supports tracking changes, the branch command works different. A branch doesn’t require you to create a subdirectory in the same repo. Instead it creates a new information tree in Git’s internal filesystem. So you can have separate branches with very different directory structure, but without cluttering your visible directory hierarchy. git checkout branchname will move you to a new branch and change the visible filesystem in place to match the branch’s. So you can traverse your file structure as normal without even thinking about the fact that there are other branches. This comes in very handy for automated testing or building. You don’t have to repoint your testing infrastructure to a new directory to test multiple branches. If your branches have the same layout, you can just switch branches and run the same testing scripts.

Merging is also made simple. Suppose you have an experimental branch that you want to merge into the master branch. First checkout the master branch (git will tell you if you’re already there) and just pull in the experimental branch as so:

git checkout master
git pull . experimental

You don’t have to worry about specifying revision numbers, and if some changes in experimental had already found their way into master (via a previous pull maybe), git will simply ignore them (or let you manually choose which edits to commit if there are conflicts). Git merging by example is an excellent article that explores branching and merging with a series of simple examples.

All of Git’s features make non-linear development easy. You don’t have to think too hard about branching off the main trunk or making changes to another part of your current branch. The Index and merging features make it easy to organize and alter changes after you’ve made them. You spend less time and effort managing your commits and you can devote more energy to actually dealing with your code base, safe in the knowledges that your changes will be recorded just as you want them to be. Whether you’re a one-man powerhouse hacking on a number of different projects in your spare time or a member of a commited team working to a tight schedule, Git can be a very powerful tool in your arsenal of code-management utilities. Use it well.

Moving from Subversion to Git

I just finished moving my files from Subversion to Git. Git is a distributed version control first built by Linus Torvalds and has matured a lot since its first creation. It is currently used by a number of important open source projects including the Linux kernel, Perl, Wine and Ruby on Rails. I chose to move to Git for a number of reasons:

  1. Distributed: So I can make commits even if I’m not online and have a complete history of changes.
  2. Easy branching and merging: I found myself keeping a ‘scratch’ folder in Subversion and only transfer changes to a working copy once I had finished all my changes. I feel that this defeats the purpose of having version control. Git supports easy branching so I can make experimental branches (which have their own histories) and then merge them back into the main branch when I’m ready.
  3. Normal commands like mv, cp, rm can be used as Git doesn’t track files individually.
  4. Interacts with SVN: I’ll need to use Subversion for my school projects, but I can have a personal Git repository where I make regular commits and only push to the team’s Subversion repo when nothing is broken.
  5. All the cool kids are using it.

I have two subversion repos: one for my code and one for other documents. The move from Subversion to Git was actually quite smooth. My repositories currently live on my old G4 Powermac, so I decided to do the transition on that machine itself (though I didn’t need to due to Git’s distributed nature). I had found a quick tutorial which I followed to do the actual move from a Subversion repo to a Git repo. I then did a quick git clone as follows on my laptop and desktop:

git clone ssh://domain.com/path/to/repo/

I could have used the simple Git server instead of SSH, but since I would be doing regular pulls and pushes (updates and upstream commits), I decided to just use SSH uniformly. Since then I’ve made changes to my Source repo and it has synced properly to my desktop. I ran into a problem where one of the git commands could not be executed on the remote machine. This turned out to be a problem with SSH on OS X, where the path for the non-interactive shell started by Git didn’t have the proper path. I didn’t research it very much because it turned out that the adding the following line to my ~/.bashrc solved the problem.

PATH = $PATH:/usr/local/git/bin

This adds the git path to the users path and lets the clone run properly.

The only disadvantage is that I have do a commit to my local repo and then push it to the repo on my server. However it’s still a simple process. Whenever I make a change I want to save I do a simple

git commit -a

and enter an appropriate message. Then before I leave my computer for a long time I do a

git push origin

which synchronizes all my local branches with those on the server. A simple git pull suffices to update the local repo.

I haven’t had a chance to use this system fully yet as I haven’t done much moving about. But classes start tomorrow so I hope to have a chance to use my new system properly. In particular I plan to make use of the easy branching and the SVN integration. My college’s lab machines don’t currently have Git installed on them, but I’m going to request to have Git installed. From what I’ve seen of the faculty, this shouldn’t be hard to accomplish. Here’s looking forward to the rest of the year using Git to boost my programming productivity.