Sunday Selection 2017-06-25

Around the Web

The Largest Git Repo on the Planet

I’m always a fan of case studies describing real world software engineering, especially when it comes to deploying engineering tools, and contains charts and data. This article describes Microsoft’s efforts to deploy the Git version control system at a scale large enough to support all of Windows development.

Why our attention spans are shot

While it’s no secret that the rise of pocket-sized computers and ubiquitous Internet connections have precipitated a corresponding decrease in attention span, this is one of the most in-depth and researched articles I’ve seen on the issue. It references and summarizes a wide range of distraction-related issues and points to the relevant research if you’re interested in digging deeper.

Aside: Nautilus has been doing a great job publishing interesting, deeply researched, and well-written longform articles, and they’re currently having a summer sale. The prices are very reasonable, and a subscription would be a great way to support good fact-based journalism in the current era of fake news.

How Anker is beating Apple and Samsung at their own accessory game

I own a number of Anker devices — a battery pack, a multi-port USB charger, a smaller travel charger. The best thing I can say about them is that by and large, I don’t notice them. They’re clean, do their job and get out of my way, just as they should. It’s good to see more companies enter the realm of affordable, well-designed products.

From the Bookshelf

Man’s Search for Meaning

I read this book on a cross-country flight to California a couple months ago, at a time when I was busy, disorganized, stressed and feeling like I was barely holding on. This book is based on the author’s experience in Nazi concentration camps during World War II. The book focuses on how the average person survives and reacts to life in the brutality and extreme cruelty of a concentration camp. The second part of the book introduces Frankl’s theories of meaning as expressed in his approach to psychology: logotherapy. In essence, the meaning of life is found in every moment of living, even in the midst of suffering and death.


Black Panther Trailer

I’m a big fan of Ta-Nehisi Coates’ run of Black Panther and really enjoyed the Black Panther’s brief appearance in Captain America: Civil War. This trailer makes me really excited to see the movie when it comes out, and hopeful that it will be done well. If you’re new to the world of Wakanda in which Black Panther will be set, Rolling Stone has a good primer.

Show Git information in your prompt

I’ve been a sworn fan of version control for a good few years now. After a brief flirtation with Subversion I am currently in a long term and very committed relationship with the Git version control system. I use Git to store all my code and writing and to keep everything in sync between my machines. Almost everything I do goes into a repository.

When I’m working I spend most of my time in three applications: a text editor (generally Emacs), a terminal (either iTerm2 or Gnome Terminal) and a browser (Firefox or Safari). When in Emacs I use the excellent Magit mode to keep track of the status of my current project repository. However my interaction with git is generally split between Emacs and the terminal. There’s no real pattern, just what’s easiest and open at the moment. Unfortunately when I’m in the terminal there’s no visible cue as to what the status of the repo is. I have to be careful to run git status regularly to see what’s going. I need to manually make sure that I’ve committed everything and pushed to the remote server. Though this isn’t usually a problem, every now and then I’ll forget to commit and push something on one of my machines, go to another and then realized I’ve left behind all my work. It’s annoying and kills productivity.

Over the last few days I decided to sit down and give my terminal a regular indicator of the state of the current repository. So without further ado, here’s how I altered my Bash prompt to show relevant Git information.

Extracting Git information

There are generally three things I’m concerned about when it comes the Git repo I’m currently working on:

  1. What is the current branch I’m on?
  2. Are there any changes that haven’t been committed?
  3. Are there local commits that haven’t been pushed upstream?

Git provides a number of tools that gives you a lot of very detailed information about the state of the repo. Those tools are just a few commands away and I don’t want to be seeing everything there is to be seen at every step. I just want the minimum information to answer the above question.

Since the bash prompt is always visible (and updated after each command) I can put a small amount of text in the prompt to give me the information I want. In particular my prompt should show:

  1. The name of the current branch
  2. A “dirty” indicator if there are files that have been changed but not committed
  3. The number of local commits that haven’t been pushed

What is the current branch?

The symbolic-ref command shows the branch that the given reference points to. Since HEAD is the symbolic reference for the current state of the working tree, we can use git symbolic-ref HEAD to get the full branch. If we were on the master branch we would get back something like refs/heads/master. We use a little Awk magic to get rid of everything but the part after the last /. Wrapping this into a litte function we get:

function git-branch-name
    echo $(git symbolic-ref HEAD 2>/dev/null | awk -F/ {'print $NF'})

Has everything been committed?

Next we want to know if the branch is dirty, i.e. if there are uncommitted changes. The git status command gives us a detailed listing of the state of the repo. For our purposes is the very last line of the output. If there are no outstanding changes it says “nothing to commit (working directory clean)”. We can isolate the last line using the Unix tail utility and if it doesn’t match the above message we print a small asterisk (*). This is just enough to tell us that there is something we need to know about the repo and should run the full git status command.

Again, wrapping this all up into a little function we have:

function git-dirty {
    st=$(git status 2>/dev/null | tail -n 1)
    if [[ $st != "nothing to commit (working directory clean)" ]]
        echo "*"

Have all commits been pushed?

Finally we want to know if all commits to the respective remote branch. We can use the git branch -v command to get a verbose listing of all the local branches. Since we already know the name of the branch we’re on, we use grep to isolate the line that tells us about our branch of interest. If we have local commits that haven’t been pushed the status line will say something like “[ahead X]”, where X is the number of commits not pushed. We want to get that number.

Since what we’re looking for is a very well-defined pattern I decided to use BASH’s built-in regular expressions. I provide a pattern that matches =”[ahead X]” where X is a number. The matching number is stored in the BASH_REMATCH array. I can then print the number or nothing if no such match is present in the status line. The function we get is this:

function git-unpushed {
    brinfo=$(git branch -v | grep git-branch-name)
    if [[ $brinfo =~ ("[ahead "([[:digit:]]*)]) ]]
        echo "(${BASH_REMATCH[2]})"

The =~ is the BASH regex match operator and the pattern used follows it.

Assembling the prompt

All that’s left is to tie together the functions and have them show up in the BASH prompt. I used a little function to check if the current directory is actually part of a repo. If the =git status= command only returns an error and nothing else then I’m not in a git repo and the functions I made would only give nonsense results. This functions checks the =git status= and then calls the other functions or does nothing.

function gitify {
    status=$(git status 2>/dev/null | tail -n 1)
    if [[ $status == "" ]]
        echo ""
        echo $(git-branch-name)$(git-dirty)$(git-unpushed)

Finally we could put together prompt. BASH allows for some common system information to be displayed in the prompt. I like to see the current hostname (to know which machine I’m on if I’m working over SSH) and the path to the directory I’m in. That’s what the \h and the \w are for. The Git information comes after that (if there is any) followed by a >. I also like to make use of BASH’s color support.

function make-prompt
    local RED="\[033[0;31m\]"
    local GREEN="\[033[0;32m\]"
    local LIGHT_GRAY="\[033[0;37m\]"
    local CYAN="\[033[0;36m\]"

${GREEN} \w\
${RED} \$(gitify)\
${GREEN} >\



I like this prompt because it gives me just enough information at a glance. I know where I am, if any changes have been made and how much I’ve diverged from the remote copy of my work. When I’m not in a Git repo the git information is gone. It’s clean simple and informative.

I’ve borrowed heavily from both Jon Maddox and Zach Holman for some of the functionality. I didn’t come across anyone showing the commit count, but I wouldn’t be surprised if lots of other people have it too. There are probably other ways to get the same effect, this is just what I’ve found and settled on. The whole setup is available as a gist so feel free to use or fork it.

Thinking about Documentation

My friend Tycho Garen recently wrote a post  about appreciating technical documentation. As he rightly points out technical documentation is very important and also very hard to get right. For someone who writes code I find myself in the uncomfortable position of having my documentation spread out in at least 2 places.

A large part of my documentation is right in my code in form of comments and docstrings. I call this programmer-facing documentation. It is documentation that will probably only be seen by other programmers (including myself). However, even though it might only be seen by programmers who are using (or changing) the code doesn’t mean that it should just be in the code. More often than not, it’s advisable to be able to have this documentation exported to some easier-to-read format (generally hyperlinked HTML or PDF). Of course I don’t want everyone who wants to use my software to go digging through the source code to figure out how things work. A user manual is generally a good idea for your software no matter how simple or complex it might be. At the very least there should be a webpage describing how to get up and running.

One of the major issues of documentation is that it’s either non-existent or hopelessly out of date. A large part of the solution is simply effort and discipline. Writing good comments and later writing a howto are habits that you can cultivate over time. That being said, I’d like to think that we can use technology to our benefit to make our job easier (and make writing and updating documentation easier).

Personally I would love to see programming languages grow better support for in-source comments. Documentation tools like Javadoc and Epydoc certainly help in generating documentation and give you a consistent, easy-to-understand format, but the language itself has no idea about what the comments say. They are essentially completely separate from the code even though they exist side by side in the same file. I would love it if languages could work together with the documentation, say by autogenerating parts of it, or doing analyses to detect inconsistencies.

As for documentation that lives outside of the code, I’m glad to see that there is a good deal of really good work being done in this area. Github recently updated their wiki system so that each wiki is essentially a git repo of human-readable text files that are automatically rendered to HTML. Github’s support for Git commit notes and their excellent (and recently revised) pull requests systems provide really good systems for maintaining a conversation around your code. The folks over at Github understand that code doesn’t exist by itself and often requires a support structure of both documentation and discussion surrounding it to produce a good product.

So what’s my personal take on the issues? As I’ve said before I’m starting work on my own programming language and I intend to make documentation an equal partner to the code. I plan on making use of Github Pages to host the documentation in readable from right next to my source code. At the same time, I’m going to giving some thought into making documentation a first class construct in the language. That means that the documentation you write is actually part of the code instead of being an inert block of text that needs to be processed externally. The Scribble documentation system built on top of Scheme has some really interesting ideas that I would love to look into and perhaps adapt. Documentation has always been recognized as an important companion to coding. I’m hoping that we’re getting to the stage where we actually pay attention to that nugget of common wisdom.

Release schedules and version numbers

I just finished a major rewrite and overhaul of my long-term research project and pushed it out to other students who are working with me on it. In the process of the redo I rewrote large parts of it to be simpler code and added a few important features. I also cleaned up the code organization (everything is neatly divided into directories instead being spread throughout the toplevel), added comments and rewrote the documentation to actually described what the program did and how to use it. But it wasn’t just a pure rewrite and refactoring. I added at least one important new feature, added a completely new user interaction mode and changed the code architecture to explicitly support multiple interfaces. But the thing is that even though I’ve “shipped” it, it’s still not quite done.

There are significant parts missing. The unit testing is very, very scant. There is almost no error handling. The previous version had a GUI which I need to port to the new API/architecture. I also want to write one more interaction mode as a proof of concept that it can support multiple, different modes. The documentation needs to be converted to HTML mode and there are some utility functions that would be helpful to have. In short, there’s a lot that needs to be done. So my question is, what version of my code is this?

I started a rewrite of this last  summer as well but never finished — a casualty classic second system effect. For a while I considered calling this version 3.0 counting the unfinished copy as 2.0. But I decided it was rather silly and so I’ve actually called it 2.0. Though it’s certainly a major major change from the last version, in some ways it’s still broken and unfinished. Is it a beta? Or a release candidate? I suppose that’s a better description. Except the additions that I want to make are more than moving it from a beta to a full release. The GUI would definitely be a point release.

In many ways the debate is purely academic and kinda pointless. As I’ve written before, software is always beta. However, releasing major and minor “versions” of software is a popular activity. In some ways it’s helpful to the user. You can tell when something’s changed significantly and when you need to upgrade. In an age where you had to physically sell software, that was a good thing to know. However, the rise of web-based software has changed that to a large extent. If you’ve been using Gmail for a while, you’ll know that it has a history of small, regular atomic improvements over time. And it’s not just Gmail, it’s most of Google’s online services. Sometimes there are major facelifts (like Google Reader a few years ago) but by and large this gradual improvement works well. Google Chrome also uses this model. Chrome is officially past version 5 now. But thanks to its built in auto update mechanism you don’t need to care (and I suspect most people don’t). Rolling releases are clearly acceptable and may just be the way software updates are going to go in the future. Of course, if you’re charging for your code you’re going to have some sort of paywall, so no, manual software updates probably won’t go away forever.

Coming back to my original question, what version did I just release? 2.0? 2.0 beta 1? 1.9.5? Honestly I don’t really care. Part of my disinterest stems from the fact that Git makes branching and merging so easy. It’s hard to care about version numbers and releases when your code is in the hands of a system that makes it so easy to spin off feature branches and then merge them back in when they’re ready. If I worked in a fully Git based team I’d just have everyone running daily merges so that everyone just automatically got the new features. In that case I wouldn’t have waited to release. The big new feature would have been pushed a week ago, the reorganization and cleanup after that and then the documentation yesterday. I’d also be sending out the later updates and addition one at a time once they were done. Everyone else uses SVN, there might still be a way to do it.

In conclusion: rolling releases are awesome. Users don’t have to worry about manually updating and automagically get new features when they’re available. Developers using a good version control system can be up-to-date with everyone else’s code. This is especially important if you’re writing developer tools (which I am): the faster and easier you can get your updates to the people making the end product the faster the end product gets developed.

PS. If you’re wondering what exactly it is I’m making, more on that later. There’s a possibility of a source release after I talk to my professor about it.

My confession: I’m a data hoarder

There’s this new show on A&E TV called Hoarders. From the show’s website, each episode “is a fascinating look inside the lives of two different people whose inability to part with their belongings is so out of control that they are on the verge of a personal crisis”. It’s an interesting show about people, who quite simply, have too much stuff. I’ve watched a few episodes, it’s somewhat repetitive, and strangely addictive in the way that only these things can be. Though I never gave the show much thought after I finished watching an episode, a few days ago I had a strange epiphany: I might be a data hoarder.

Here’s the gist of it: I’m afraid of losing data. It’s not that I have a ton of important stuff which I use regularly, in fact much of what I have on hard drive (besides my music and pictures) are things I will probably never actively use again. What I’m actually afraid of is that someday I’m going to want some file (or some specific version of some file) and I won’t be able to find it. Now even if I do have the file, I might not find it due to poor organization and data retrieval systems, but that’s a matter for another blog post. What I’m afraid of is pure, simple data loss: I start working on a project, which I only have one copy of, and something happens to that one copy, whether it be a hard drive crash, or just human error and accidental deletion. And then I have to start all over again, with no real idea of what I did the first time.

Now, thanks to technology I’ve been able to deal with my hoarding instincts, without having dozens of different versions littering my hard drive and doing manual backups every week. At the heart of my system is Git, which lets me keep everything that’s important to me in strict version control. It also lets me easily keep files in sync between different machines, which is a problem I still haven’t completely solved (especially for public machines). By keeping things in sync between three different machines, I have backups in three completely different (as in physically separate) places.

The second thing that keeps my data in control is Amazon’s S3, with JungleDisk. Once a week, this ships all my Git repositories, music, pictures and various software installers to Amazon’s massively distributed storage servers for less than $5 a month. The choice was either this, paying as I go, or buying a terabyte hard disk. Personally I think made the right choice, since my backups are not only safe and secure in a far away place, and I didn’t have to shell out a lumpsum in one go.

Now all this was fine, but lately I’ve been having this urge to record everything. And I mean everything. There are all my tweets and dents which go out into the ether of cyberspace, which I might someday want to have on record. There are all the websites I visit and most recently all the music I listen to and the movies I see. In a perfect world, I would have all my tweets saved to a Git repository and all the DVDs I watch and music I listen to would be instantly ripped and placed in cold storage in an Amazon bucket (or a terabyte disk). And this may not be a good thing, for the reason that I wouldn’t see most of the DVDs for the second time and I have no idea why I would want to save my tweets (or ever look them up).

In the past week I’ve been sorely tempted to actually buy a terabyte hard drive and start manually ripping all the DVDs that I watch. I even went so far as to install Handbrake on my Mac Mini. I’ve been trying very hard to override the temptation with my logic (and laziness). It’s been hard but I’ve been successful so far. Underneath this is perhaps a more important issue: how much data is enough and how safe is safe enough? Keeping my own created data completely backed up in multiple places I think is perfectly acceptable, but I think that ripping all the DVDs is borderline obsessive. It would be an interesting thing too, and might be worth something in terms of street geek cred, but it’s not something that I can seriously see myself doing (and it’s possibly illegal too).

So there you have it, I’m a data hoarder, or at least I have data hoarding tendencies. No, I don’t need an intervention yet, and I don’t need treatment. In fact, I think I’m at the point where I’m reliably saving and backing up everything that I create (that’s more than 140 characters) but not randomly saving everything that I come into contact with. Maybe in another place and time I will actually be saving all my movies as well, but that will probably in terms of actually buying DVDs and having a properly organized collection, instead of borrowing them from the library. For the time being, I trust my digital jewels to Git, three computers and Amazon S3.