Web interfaces for source control

I came across two articles about web-based interfaces for source control. The first is a critique of GitHub’s UI. The second is an explanation of some of the design choices for Sourcehut, a new 100% free and open source software forge.

If you’re interested in interfaces, or software engineering tools, I highly recommend reading both. They are short, will only take a few minutes of your time, and maybe make you think about functionality you take for granted, or issues you’ve learned to ignore and live with.

Personally, I like GitHub’s general prettiness, but I agree that there’s a lot of unnecessary UI elements, and not enough of (what I would consider) key features for effectively browsing source. The above-linked article mentions the difficulty of switching between individual files, history and branches, while links to Enterprise pricing, or starring repos are on every page. Part of that can be chalked up to GitHub’s position in between a software forge and a social network (because  we’re still in the phase where we think everything needs to be a social network).

To be fair, Sourcehut is a bit too spartan for my tastes. If nothing else, I like good use of whitespace and nice fonts. (Aside: consider using Fira Code or Triplicate for displaying source code.) And I can’t tell how to easily move between code and history views on Sourcehut either. But at least its motivations are more clear, the appearance issues can probably be solved using user style sheets, and if you’re really peeved about its choices, you can fork it (though it’s almost certainly not worth the effort).

I haven’t really used similar tools (except for a pretty barebones code diff and review tool at a company I worked at briefly), so I wonder if there are other examples that can provide interesting lessons.

Refactoring my personal Git repository

People usually talk about refactoring when they’re talking about code. Refactoring generally involves reorganizing and restructuring code so that it maintains the same external functionality, but is better in some non-functional way. In most cases refactoring results in code that is better structured, easier to read and understand and on the whole easier to work with. Now that my exams are over, I decided to undertake a refactoring of my own. But I didn’t refactor code, but rather my entire personal source code repository.

About a year ago I started keeping all my files under version control. I had two Subversion repositories, one for my code and another for non-code related files (mostly school papers). A few months ago I moved from Subversion to Git, but my workflow and repository setup was essentially the same. When I moved to Git, I had a chance to change my repo structure, but I decided to keep it. The single repo would serve as a single storage unit for all my code. Development for each project would take place on separate branches which would be routinely merged back into the main branch. The files in the repo were divided into directories based on the programming language they were written in. Probably not the most scientific classification scheme, but it worked well enough.

Fast forward a few months and things aren’t quite as rosy. It turns out that having everything in one repo isn’t really a good idea after all. The single most significant reason is that the history is a complete mess. Looking back at the log I have changes to my main Python project mixed in with minor Emacs configuration changes that I made as well as any random little experiment that I did and happened to commit. Not very tidy. Secondly, using a separate branch for each project didn’t quite work. I’d often forget which branch I had checked out and start hacking on some random thing. If I was lucky I could switch branches before committing and put the changes where they belonged. If I was unlucky, I was faced with the prospect of moving changes between branches and cleaning up the history. Not something I enjoyed doing. Finally, organization by language wasn’t a good scheme especially since I took a course in programming languages and wanted to save the textbook exercises for each language. The result is that now I have a number of folders with just 2-3 files in them and I won’t be using those languages for a while. More importantly, getting to my important project folders meant digging down 3-4 levels down from my home directory.

I decided last week that things had to change. I needed a new organization system that satisfies the following requirements:

  1. Histories of my main projects are untangled.
  2. My main projects stand out clearly from smaller projects and random experiments.
  3. If I start a small project and it gets bigger it should be easy to give it main project status.
  4. An archive for older projects that I won’t be touching again (or at least not for the foreseeable future).
  5. Some way to keep my schoolwork separate from the other stuff.
  6. Everything is version controlled and I should be able to keep old history.

I’ve used a combination of Git’s repository splitting functionality and good old directory organization to make things cleaner. Everything is still tucked into a top-level src directory, but that’s where the similarities with my old system end. Each major project is moved to its own repo. Since I already had each major project in its own subdirectory, I could use Git’s filter-branch command to cleanly pull them out, while retaining history. Every active project gets its own subdirectory under ~/src which has a working copy of the repo. There is a separate archive subdirectory which contains the most recent copy of the projects that I’ve designed to file away. I technically don’t need this since the repositories are stored on my personal server, but I like having all my code handy.

I pooled together all my experimentation into a single repo called scratch. This also gets its own subdirectory under src. It currently holds a few simple classes I wrote while trying out Scala, some assembly code and a few Prolog files . My schoolwork also gets a separate repo and subdirectory. This contains code for labs in each class as well as textbook exercises (with subdirectories for each class and book). Large projects get their own repo and aren’t part of this schoolwork repo. Since I’m on break they’re all stashed under archive.

The process of actually splitting the repo into the new structure was smooth for the most part. I followed the steps outlined by this Stack Overflow answer to extract each of my main projects into its own repo. I cloned my local repo to create the individual repos but I still had setup remotes for each of them on my server. I followed a really good guide to setup the remotes, but first I had to remove the exiting remotes (which pointed to the local repo which I had cloned from). A simple git remote rm origin took care of that.

Things started to get a little more complicated when it came to extracting things that were spread out (and going into scratch). I wasn’t sure if filter-branch could let me do the kind of fine-tuned extraction and pooling I wanted to do. So I decided instead to create a scratch directory in my existing repo and then make that into a separate repo. I used the same process for extracting code that would go into my schoolwork repo.

The whole process took a little over 2 hours with the third Pirates of the Caribbean movie playing at the same time. I’m considering doing the same thing wiht my non-code repo, though I’ll need to think out a different organization structure for that. Things were made a lot easier and faster by the two guides I found and now that I have a good idea of what needs to be done, I’ll probably have an easier time next time around. I’ve come to learn a little more about the power and flexibility of Git. I’m still think I’m a Git newbie, but at least I know one more trick now. If any of you Git power users have any suggestions or alternate ways to get the same effect, do let me know.

A week with Git

It’s been almost a week since I moved to Git for my version control instead of Subversion. It’s been an interesting experience with some amount of learning to do. There are a number of extra things to keep in mind when using Git, but as time goes by, I’m quickly becoming used to them and they don’t seem bothersome at all.

The first major difference is going from centralized to distributed. I use version control to keep my files in sync across multiple computers as well. With Subversion a commit would save my changes and also push them to my remote server. With Git, that is now two separate processes. I would have liked to be able to have them both as a single step, but honestly, it really isn’t much of a problem. I’ve become used to doing a git push just before I leave my computer.

Once I got past that, the next thing that I had to deal with was the fact that a commit even to the local repo is actually two separate operations. You first have to add any changes made to something called the Index. It’s easiest to think of the Index as a sort of staging ground for your changes. You can selectively add files to the Index, building up a coherent set of changes and then commit them together. The Index can take some getting used to, but it is easily one of Git’s killer features. Often enough, you end up editing a bunch of completely different files. They could be files in different projects, or files in the same project that do totally different. When it comes time to commit, you’d like to be able to commit the different changes separately so that you can track exactly what changes you made by project. The Index lets you do exactly that. In fact, git will actually let you select which set of changes to a particular file you want to add to Index in a given add operation if you do a git add --patch filename . This gives you an unprecedented level of fine-grained command over how the change history of your files are saved. I haven’t had a chance to use this level of power yet, but as the semester progresses and I get involved in large project, i don’t think it’ll be long before Git saves the day. See this post for a more detailed discussion of how useful Git’s Index is.

Even if all the changes to the change history could be made only in the Index, that would be an incredible amount of power. But Git’s power stretches beyond the Index to the commit operation itself. The  git commit --amend operation pushes the changes in the Index into the previous commit which comes in handy when you forget a change, or realize that you need to change something else. And if that isn’t enough, the git rebase --interactive command basically hands you a nuclear-powered swiss army chainsaw with which you can go to town on your repository. Needless to say, this kind of power isn’t too be used without some amount of care and consideration.

Git’s control over committing is impressive, but equally impressive is Git’s branching capabilities. In Subversion, a branch means copying the current files to another subdirectory in the same repository. This means that all branches are visible at all times and that anyone checking out the full repository will get all branches. This isn’t always what you want. Sometimes you don’t everyone to see your changes until they’re ready for prime time and you might want new developers to get the production branch and then later pull additional branches as needed. Since Git is in many ways a filesystem that supports tracking changes, the branch command works different. A branch doesn’t require you to create a subdirectory in the same repo. Instead it creates a new information tree in Git’s internal filesystem. So you can have separate branches with very different directory structure, but without cluttering your visible directory hierarchy. git checkout branchname will move you to a new branch and change the visible filesystem in place to match the branch’s. So you can traverse your file structure as normal without even thinking about the fact that there are other branches. This comes in very handy for automated testing or building. You don’t have to repoint your testing infrastructure to a new directory to test multiple branches. If your branches have the same layout, you can just switch branches and run the same testing scripts.

Merging is also made simple. Suppose you have an experimental branch that you want to merge into the master branch. First checkout the master branch (git will tell you if you’re already there) and just pull in the experimental branch as so:

git checkout master
git pull . experimental

You don’t have to worry about specifying revision numbers, and if some changes in experimental had already found their way into master (via a previous pull maybe), git will simply ignore them (or let you manually choose which edits to commit if there are conflicts). Git merging by example is an excellent article that explores branching and merging with a series of simple examples.

All of Git’s features make non-linear development easy. You don’t have to think too hard about branching off the main trunk or making changes to another part of your current branch. The Index and merging features make it easy to organize and alter changes after you’ve made them. You spend less time and effort managing your commits and you can devote more energy to actually dealing with your code base, safe in the knowledges that your changes will be recorded just as you want them to be. Whether you’re a one-man powerhouse hacking on a number of different projects in your spare time or a member of a commited team working to a tight schedule, Git can be a very powerful tool in your arsenal of code-management utilities. Use it well.

Save everything

The hard drive in my first computer was a mere 20GB. When we had to replace it three years later, we got one with a 40GB capacity. My current laptop (which was the first computer I bought on my own) has a 160GB hard drive. Even my pocket sixed iPod classic sports 80GB . Today you can get terabyte hard-drives for a few hundred dollars. So the question is: with all this massive storage easily available, why is the user still prompted to save their work? Why not just save everything?

In a way, even though we have 21st century storage capacity, we still have 20th century storage techniques. Filesystems of today while certainly efficient at doing their job don’t quite do the right job. After all the bulk of today’s personal data isn’t in the form of documents that can be neatly sorted into hierarchical categories. Instead most of our data is in the form of pictures, music and video, for which the easiest interface is some sort of multi-property based search rather than directory-style storage. As our data grows, metadata will become increasingly important.

But even if we have rich metadata, that’s still only going to take up a small amount of space. What I would really like to see is ubiquitous versioning. Any changes that I make to any files (including moving and renaming) should be easily viewable and I should be able to roll back to previous versions without any difficulty. Software developers have already been using robust versioning systems for decades, but I would like to see it become an inherent property of the file-storage system itself. Versioning goes hand in hand with backups, and while Apple’s Time Machine is a step in the right direction, its still got a while to go.

Another twist in the storage tale is that though local storage in the form of hard drives and flash drives is becoming dirt cheap, online bulk storage is cheaper still (and in some cases free). Unfortunately there is often quite some work to be done to get reliable online storage working seamlessly with your local machine. Like versioning, the technology is already out there, it just needs to be packaged into a convenient, always-available form.

So where do we start? I think Google Docs has shown a good starting point: instead of making the user explicitly save something, applications should just go ahead and save it anyway. If the user decides to actually keep it she can then rename it to something meaningful and move to somewhere else. Perhaps there should be some sort of garbage collection where files that were autosaved and then untouched are deleted after a certaing amount of time (after asking the user, of course). Or you could just save everything forever and only run garbage collection if disk space gets dangerously low.

Once you have a basic save-everything system, you could add versioning on top of that. I was hoping to find a versioning filesystem already in existence, but the closest to a fully operational one that I could find were Wayback and CopyFS, not quite what I’m talking about yet. ZFS shows some promise with its copy-on-write and snapshot features. Hopefully it will only be a few more years before one of the major OS makers (or an Open-source initiative) decide to bake version control into the filesystem (or at least tie them together closely). Once we have the capability to store such massive amounts of versioned data seamlessly, we need a way to find it all. WinFS would have gone a long way to solving this problem, if it had ever gotten finished. I’ve personally come to see the shelving of WinFS as one of the greatest tragedies our industry has faced in recent times. The hierarchical file structure is being pushed to the limit and WinFS would have given a good way forward. However, as personal data gets into the terabyte range, we will absolutely need filesystems that can work with rich metadata. Hopefully WinFS will be pulled out of the mothball or Apple will come up with a working solution post Snow Leopard.

Right now I’m stuck with my large hard-drives hopelessly underutilized. I’ve started trying some home-grown solutions such as putting all my documents under version control. Over the next semester at college I’ll also be experimenting with S3 and trying to run a personal backup server. Hopefully I’ll be able to put all those gigabytes to work.