Amazon S3 as a personal backup service

I’ve been interested in online backups of my data for a long time. At first I started by simply keeping my data on a free FTP server. With the rise of Web2.0 easy to use online storage, I tried a number of solutions including Box.net. I also made a backup to my Gmail account once using Gspace. When I tried web office suites from Zoho and Google, I also updated most of my documents. But time goes by, and I never really managed to stick to a single solution. I’ve also made DVD backups of my data occasionally but never with any regularity. A few months ago, I decided to drop using third party solutions and instead set up a personal Subversion server on an older Mac that I had. This system has been pretty effective at keeping my day to day to files backed up and synced across multiple machines.

While the home Subversion system does a good job of keeping backups of my most important documents, I don’t put everything in my SVN repository. Mainly, my music and my photos are mostly on a single machine (my Mac Mini). I’ve recently started looking at an efficient way to create large scale backups of all my files at less regular intervals than my SVN commits. I considered using an external hard drive, but even if I backed up all my files, I would really be using only a small amount of the space available. Plus, being a college student, I didn’t want another piece of hardware to lug around. Also, I wanted a backup that I could access no matter where I was — something online.

I would look at the Web 2.0 solutions, but none of them are cost-effective for the large amounts of data I plan on backing up. My music weighs in at around 16GB and will keep growing. My photos are about 2GB and will also keep growing. I would also like to backup copies of software discs I have paid for. There aren’t many of them, but still a few gigabytes worth. Add to that are all my regular documents and other files. All told, I’m looking at something in the range of about 25GB for a first time backup and growing over time. Considering that I’m a starving college student, the cheaper the better.

Enter Amazon Simple Storage Service

Amazon S3 is an industry standard storage solution that can easily handle many terabytes of storage and bandwidth. Whats also important is that it is very cheap. For the amount of storage I’ll be using, I’ll be spending 15 cents per GB per month and 10 cents to upload each gigabyte. Here’s a rough calculation for what my costs will be like:

Initial Backup: ($0.15 perGB * 25GB) + ($0.10GB *25GB) = $3.75 + $2.50 = $6.25

Monthly Running Cost: ($0.15 perGB * 25GB) + ($0.10GB *2GB) + ($0.17 * 1GB) = $4.12

For the running cost I estimated an upload of 2 GB a month and 1GB download. Though this is probably (especially the download) more than what I actually will be using, its should be a fairly good estimate since the amount stored will be gradually going up. So for less than the price of a regular lunch I’ll be able to keep all my important files safely backed up in a safe online location.

The Catch

The catch is that Amazon S3 is not meant to be a storage solution for general users. It’s an enterprise quality system made to plug directly into a high performance online service. As a result S3 offers a fully functional API to write programs around it, but there’s no easy to use interface for users to manage their uploads. Luckily there are a number of third party tools available that fill the need. Here’s a somewhat outdated list of some available tools. The client that I’ll be using is called JungleDisk. It’s a wonderful cross-platform tool that maps your S3 storage as a storage drive on your computer. This means that you can use it as you would have storage disk attached to your machine, and you can also run scripts that automatically backup data to your S3 from other parts of your computer. JungleDisk also provides its own automation facilities to regularly backup your data. No more having to remember to backup once a month.

JungleDisk costs $20.00, but I think that’s an acceptable price, considering that you can install on as many machines as you like (including Mac, Windows and Linux systems) and you get free lifetime updates, meaning you never pay for anything again. For a dollar a month you can get the JungleDisk Plus service that lets you access your files via a web interface, allows resuming uploads if they are interrupted and lets you upload only changed parts of large files (hence saving upload costs). At this point, I don’t think I’ll have a need for Plus, but it’s a good choice if you travel a lot and plan on using S3 as your primary syncing mechanism.

Starting next month…

I’ll be backing up to S3 regularly via Jungle Disk. I plan on making the initial transfer over this weekend (while recovering from Halloween parties). Before that I need to get my files organized and decide on what I will backup and what I won’t. I’ll post a followup once I’ve been using the service for a while to see if it really is worth the cost.

Google and Wikipedia as the gatekeepers of the Internet

In January of last year, I read a post on Coding Horror about how Google was gradually becoming the starting point of the Internet. Though Jeff Atwood’s points were certainly well made and valid, I really didn’t think much of it at the time. In the year and half since then, some things have changed. Google has been moving away from pure search but still using its position as a focal point on the Internet. Chrome and Android both increase Google’s prominence in the computing world. However, I hadn’t quite realized that riding on Google’s pre-eminence is another Internet powerhouse: Wikipedia.

My story begins like this. I was doing research on IBM’s original Personal Computer and how its BIOS was reverse engineered by Compaq to produce clones. I was focusing on the ethical issues of reverse engineering and of course, I turned to Google to find online information sources. Here are some of the search queries I typed in:

  • IBM PC
  • Reverse Engineering
  • Utilitarianism
  • Kant
  • Categorical Imperative
  • Social Contract
  • Compaq

In all but the last search, the first search result was the corresponding Wikipedia article. I found it interesting and somewhat unnerving that even IBM’s own website is the second result when searching for IBM’s most famous product.

The duopoly that is beginning to form is quite interesting. Google is gradually placing itself as the chief filter and navigator for the web. Competitors like Yahoo! and MSN would require a massive, perhaps combined effort to take on Google Search in any serious way. Newcomers like the much-hyped Cuil are simply not good enough.  With Google’s multipronged effort to cut into both business computing (Google Apps) and the common man’s net experience (Chrome and Android), it’s unlikely that this trend will reverse itself anytime soon.

Wikipedia is taking the form of the most easily accessible content provider for an increasing range of not-too-specific information (and a fair amount of specific information too). How many school or college projects today don’t involve Wikipedia in some way? Even though professors and teachers might wholeheartedly (and maybe with good reason) insist that Wikipedia cannot be cited as a reputable source, the fact remains that for many students (and for many other people) research about a topic begins (and in many cases ends) with Wikipedia. Wikipedia is becoming the Wal-Mart of the Internet. Their information may not be great, but it’s good enough and it’s cheap, in terms of money, time and effort.

So What?

Well, perhaps nothing at all. After all Google hasn’t shown any signs of taking over yet. I trust Google enough to give Gmail all my email. Even my college email is routed through Gmail because it’s the most efficient solution out there. I use Google Reader to gather information from around the web. I’m a fairly regular user of Google Calendar and Google Docs. I trust that no human eyes are viewing my information or reading my documents and the software machines running on Google’s server farms do their jobs wells. Their ads are discreet and unobtrusive.

Google makes money. Lot’s of it. Billions of dollars every year. And not just for itself. There are thousands of businesses that make millions off Google Ads and many popular businesses get some 70% of their business via Google search. A lot of Google’s money goes to paying for various open source projects in a number of ways as well as funding university research. I would rather have Google in control of that cash flow than not have it at all.

However, it cannot be denied that Google is quickly and surely becoming the internet. As Jeff Atwood tells us, if your website is not on Google, it might as well not exist. Rich Skrenta, possibly the creator of the first internet virus, is not exaggerating by much when he tells us that the internet is essentially a single point marked G connected to 10 billion destination pages. If we were to follow the more conventional analysis of the Internet as a weighted graph, the weights given to Google’s outgoing links, far outnumber those given to any other (with the possible exception, in some cases, of outgoing links from Wikipedia).

And into this, Wikipedia fits perfectly. In the free world of the Internet, it’s hard for businesses to make money by selling pure information. But pure information is the heart-blood of the internet, it’s first cause for existence. And it’s this information need that is served by Wikipedia. What you can’t buy, you can get for free on Wikipedia (mostly). Wikipedia cannot surive alone. It needs efficient search to make proper use of all its information. In exchange, it acts as a sort of secondary filter: after Google’s search filters away the cruft and deadwood that litters in the internet in the form of spam, porn and obsolete webpages, Wikipedia steps in to provide a mostly reliable core from which to branch off to other points or from which to draw inspiration for more searches. Sure you can decide to not use Wikipedia. And pay the consequent price of having to sort through the mass of knowledge by yourself (though perhaps with Google search by yourself). But would you really want to?

If you want to fight Goliath …

You’ll need more than a slingshot. Google and Wikipedia are both fairly well entrenched in their respective areas. And the tasks you’ll have to accomplish to shake them are certainly Herculean, if not harder. To beat Google, you’ll have to start with an equally good algorithm. New competitors like Cuil aren’t all that bad, but they’re not good enough. With the rise of rich media, your search engine will have to be able to get to pictures, videos perhaps even Internet radio stations. Of course, now that Google has branched away from search, you’ll have to take that into account, or at least team with someone who can. People are more comfortable using a unified interface and a single way of doing things than a bunch of smaller ones. Google still has some work to do in that area. After that you need to have a way to get people money. Breaking into Google’s Ad empire might be harder than making a dent in the search market.

As for Wikipedia, you would need to find a way to collect a vast amount of information on a variety and also keep it up to date. That’s hard to do and expensive with a proprietary model and with an open system, there are problems with abusing the system. Then there’s the question of actually getting people to use your resource. Giving it away for free is no longer good enough. You’ll have to offer something that Wikipedia doesn’t. And Wikipedia offers a lot.

Neither task is for small players. So who could do it? Someone with deep pockets for one thing. The battle for the internet isn’t going to be over in a flash, it’ll be a long protracted war lasting years (if it’s ever fought, that is). Talking about Flash, Adobe and the Flash platform are also another strong player in the arena, though in a slightly different way. Adobe and Google have mostly non-overlapping interests. However a partnership between Adobe and another strong player, such as Yahoo!, Microsoft or Amazon might just tip the balance. A combination of online software built with Flash running on backends from Yahoo! or Microsoft would be a serious contender to Google’s AJAX web platforms. At the same time, it might be more beneficial for Adobe to join hands with Google, especially since Flash is already YouTube’s backbone. Tightly integrating Flash with Chrome might cement Flash’s position as the rich content platform of the Internet (with Google Ads thrown in the mix).

Of course, I’m probably getting far ahead of myself. Any serious competition to Google would involve a concerted effort by a number of interests over an extended period. That seems unlikely to happen with the current mess of competing interests, standards and technologies. For the time being at least, the Internet is still a point labeled G. The 10 billion connections are purely coincidental.

Using programmable tools

Once upon a time, if you wanted to run software on your computer, you had to write it yourself. Any self-respecting home computer came with a simple programming language (generally some version of BASIC) which could be used to write business applications, games and a fair variety of different programs of differing complexity. But then at some point in the last few decades, programming has dropped off the radar of most computer users. Nowadays, if you say that you use your computer mostly to write programs, people will give you a weird look.

I’ve recently started to learn using GNU Emacs for my programming work, and I’ve fallen in love with it. Emacs is essentially a Lisp interpreter running on top of a small C core. What this means is that programmability is at the very core of Emacs’ design, not an add-on bolted on later. The result of this decision is easy to see in the wide variety of Emacs-Lisp applications available on the web. There are file managers, web browsers, email clients and a whole host of other things that help to make sure that you do most of your computing inside Emacs. It’s not for nothing that Emacs is often referred to as an operating system squeezed into a text editor.

However, the importance of Emacs isn’t that there is so much add-on software written for it, it’s that it is so easy to customize it and bend it to your will. For a new Emacs user like me, the first taste of the power underlying Emacs comes in simple things like being able to change key bindings easily and loading specific modes for certain file types (in my case, Standard ML, HTML and Python). The real benefits start coming later as you start making more and more customizations. Emacs changes from a general purpose text editor to a specialized tool that is gradually molded to your particular way of working. Using the custom hooks that I’ve made for my interpreters, I can already see a definite speedup in edit-test-debug cycle that I go through all the time. I still write less than about 200 lines of code a week on average and at this state, the improvements aren’t very noticeable. However, for someone who spends 20 – 40 hours a week writing code, the speed advantage can be tremendous, even counting the time taken to add customizations. It also helps that the custom code is in Emacs-lisp, a language that requires surprising few lines of code to get useful work done.

But Emacs isn’t the only piece of customizable software out there. Stumpwm is to window managers what Emacs is to text editors. It’s a tiling window manager written in Common Lisp, and can be customized on the fly without the need for a restart. Using a lean, tiling window manager like Stumpwm can take some getting used to. But once again, the time spent on the learning is quickly made up for. You no longer have to spend time spreading out windows so that you can see everything and moving application across virtual desktops is a snap. The Common Lisp programmability means that you can quickly add keybindings to commonly used functions and programs. The modeline, which takes the place of the panel in most other window managers is also programmable and you can pipe output from command line programs directly to it.

Using programmable tools takes a fair amount of commitment. It’s very tempting to just be lazy and accept the default, even though in the long run it might cost more effort. But if you spend a substantial part of your day writing code, then there really is no reason why you can’t take a few minutes out and adapt your tools to your workflow. A fair amount of programming work involves repetition: running compile jobs, tests, version control commits, file uploads, so and so forth. Often there are a bunch of different programs you need to get the job done. If you can gain even a small advantage by automating away some of these tasks, or at least combining them into a single interface, then it is to your benefit to do so. IDEs try to do some of this work for you. Unfortunately IDEs are always made by other people and so they may not gel well with your individual style and how you like things to be. Programmable tools might not have the beautiful interface of an IDE, but they are more powerful in that you have the freedom to build up your own IDE just as you want it.

At this point there will be some objections raised as to what will happens when you need to work on another computer. Seriously, how much real work do you do on another computer that you can’t customize as you need? Even if your work does involve switching computers regularly, in this age of fast internet connections, you could very easily have your custom tools (emacs configs, etc) stored on a central server and then pulled to a computer as you need them. All my code, including my Elisp code is stored on in a Subversion server and I pull them onto computer lab computers as I need them.

If you’re a UNIX users, there are many other opportunities lying around you including shell scripts for the command line, and init scripts to speed up boot time (or start up programs that you know you’ll need). Once you start programming your work environment, you won’t want to go back. I definitely plan to keep continuing to program my text editor and window manager. I’ll try to keep track of how much easier or more efficient my programmable tools make me. But I’ll really know that I’ve benefited when I become so used to my custom tools that I barely notice their presence and can devote my full attention to my work.

Quick Tip: if you’re a new Emacs user, google for “effective emacs”

Coder’s block

A good friend of mine, who is currently teaching himself Python yesterday confronted me with a rather familiar problem. He complained that he was experiencing coder’s block: that he didn’t know what he should program. This is a problem that comes up rather frequently especially for new programmers who are teaching themselves programming. Learning how to program gives one a new found sense of power and it’s understandable that soemone who has just discovered this would be itching to ty it out. The problem is: how?

No matter how high-level our languages may be or how many libraries we might have at our disposal, programming is still a considerably hard job. Writing large meaningful programs on your own, especially as a beginning student is not an easy thing to do. The two answers that I’ve heard the most, on IRC and on various forums are:

  1. Find a textbook or something equivalent and do the exercises.
  2. Write a program to automate some common task that you perform repeatedly.

Though both of these suggestions have merit, I feel that they are not really the best choice. Let’s look at textbooks and tutorials. Most of the exercises that are presented aren’t really geared toward doing something useful, but rather to learning some specific concept or feature. These exercises are great if you’re looking to building your skill set (I’ve been going through the SICP exercises in my spare time over the past few months) but they don’t give you the same satisfaction you would get from building something useful.

Writing automator programs seems to be a more popular suggestion and I think it’s a reat suggestion, very in-tune with the whole hacker ethos. However, that’s easier said than done in today’s day and age. If you’re an experienced command-line user then there are probably a lot of things that you find could use some automation: setting command options, file moves and renames etc. etc. But let’s face it, most of today’s beginnings programmers are not command line afficionados and may never be. Automating things programmatically in the world of point-and-click GUI is not a very easy matter for a number of reasons. Firstly, there’s a lack of a general-purpose ‘glue’ language to program the GUI in: nothing analogous to simple BASH or Perl script. If you’re on Windows (for example) how do you writing a simple throwaway program to quickly rename your whole MP3 collection without having a good knowledge of Windows internals? The very fact that there is actually software out there that costs money to do things like bulk renames, shows that it’s not something that an end user can easily automate on their own. Secondly, the entire graphical paradigm by itself does not easily lend itself to automation. It’s not easily clear how you’re going to automate something like setting the buttons in a configuration GUI, unless the GUI is just a front end for a config file, in which case you would just by-pass the GUI anyway.

So this still leaves us with our original question: what would a novice programmer pick up as a small project? I would say instead of looking for something to do, try reading. Read up on some moderately advanced computer science topic that you’re interested in. Then do whatever you need to do to facilitate that interest and make your learning experience complete. I’m currently doing an independent study in programming languages at the moment and this has sprung from the fact that I was bored last semester and wanted a substantial task to pass my time.

The benefits of this is two-fold: you’ll get to learn more about an advanced topic and see what you’re interested in (which is always a good thing). And it will also solve our initial problem: finding a problem that is interesting and challenging. It might also lead to more things that are interesting and more interesting code to write. In my case, I’m using doing a lot of functional programming in Lisp-like languages. Since I wanted to streamline my workflow (and get more experience with Lisp outside academic problems) I took up the task of learning Emacs. Emacs is almost infinitely customizable and programmable and I currently have a small list of things that I could do, but aren’t essential. Things that I’m putting off for a lazy afternoon.

Perhaps the jist of this post is this: scratch an itch, and lacking one, find something interesting to study. My final recommendation to my friend was to write an emulator for an old processor for which he found the manuals. He’s interested in operating systems and close to the hardware code and I think this will give him a fun project to keep him busy for a while. There’s sure to be something similar that catches your interest. The computer industry started on the back of hobbyists and people just looking to do something fun. Keep that spirit alive.