My confession: I’m a data hoarder

There’s this new show on A&E TV called Hoarders. From the show’s website, each episode “is a fascinating look inside the lives of two different people whose inability to part with their belongings is so out of control that they are on the verge of a personal crisis”. It’s an interesting show about people, who quite simply, have too much stuff. I’ve watched a few episodes, it’s somewhat repetitive, and strangely addictive in the way that only these things can be. Though I never gave the show much thought after I finished watching an episode, a few days ago I had a strange epiphany: I might be a data hoarder.

Here’s the gist of it: I’m afraid of losing data. It’s not that I have a ton of important stuff which I use regularly, in fact much of what I have on hard drive (besides my music and pictures) are things I will probably never actively use again. What I’m actually afraid of is that someday I’m going to want some file (or some specific version of some file) and I won’t be able to find it. Now even if I do have the file, I might not find it due to poor organization and data retrieval systems, but that’s a matter for another blog post. What I’m afraid of is pure, simple data loss: I start working on a project, which I only have one copy of, and something happens to that one copy, whether it be a hard drive crash, or just human error and accidental deletion. And then I have to start all over again, with no real idea of what I did the first time.

Now, thanks to technology I’ve been able to deal with my hoarding instincts, without having dozens of different versions littering my hard drive and doing manual backups every week. At the heart of my system is Git, which lets me keep everything that’s important to me in strict version control. It also lets me easily keep files in sync between different machines, which is a problem I still haven’t completely solved (especially for public machines). By keeping things in sync between three different machines, I have backups in three completely different (as in physically separate) places.

The second thing that keeps my data in control is Amazon’s S3, with JungleDisk. Once a week, this ships all my Git repositories, music, pictures and various software installers to Amazon’s massively distributed storage servers for less than $5 a month. The choice was either this, paying as I go, or buying a terabyte hard disk. Personally I think made the right choice, since my backups are not only safe and secure in a far away place, and I didn’t have to shell out a lumpsum in one go.

Now all this was fine, but lately I’ve been having this urge to record everything. And I mean everything. There are all my tweets and dents which go out into the ether of cyberspace, which I might someday want to have on record. There are all the websites I visit and most recently all the music I listen to and the movies I see. In a perfect world, I would have all my tweets saved to a Git repository and all the DVDs I watch and music I listen to would be instantly ripped and placed in cold storage in an Amazon bucket (or a terabyte disk). And this may not be a good thing, for the reason that I wouldn’t see most of the DVDs for the second time and I have no idea why I would want to save my tweets (or ever look them up).

In the past week I’ve been sorely tempted to actually buy a terabyte hard drive and start manually ripping all the DVDs that I watch. I even went so far as to install Handbrake on my Mac Mini. I’ve been trying very hard to override the temptation with my logic (and laziness). It’s been hard but I’ve been successful so far. Underneath this is perhaps a more important issue: how much data is enough and how safe is safe enough? Keeping my own created data completely backed up in multiple places I think is perfectly acceptable, but I think that ripping all the DVDs is borderline obsessive. It would be an interesting thing too, and might be worth something in terms of street geek cred, but it’s not something that I can seriously see myself doing (and it’s possibly illegal too).

So there you have it, I’m a data hoarder, or at least I have data hoarding tendencies. No, I don’t need an intervention yet, and I don’t need treatment. In fact, I think I’m at the point where I’m reliably saving and backing up everything that I create (that’s more than 140 characters) but not randomly saving everything that I come into contact with. Maybe in another place and time I will actually be saving all my movies as well, but that will probably in terms of actually buying DVDs and having a properly organized collection, instead of borrowing them from the library. For the time being, I trust my digital jewels to Git, three computers and Amazon S3.

Amazon S3 as a personal backup service

I’ve been interested in online backups of my data for a long time. At first I started by simply keeping my data on a free FTP server. With the rise of Web2.0 easy to use online storage, I tried a number of solutions including Box.net. I also made a backup to my Gmail account once using Gspace. When I tried web office suites from Zoho and Google, I also updated most of my documents. But time goes by, and I never really managed to stick to a single solution. I’ve also made DVD backups of my data occasionally but never with any regularity. A few months ago, I decided to drop using third party solutions and instead set up a personal Subversion server on an older Mac that I had. This system has been pretty effective at keeping my day to day to files backed up and synced across multiple machines.

While the home Subversion system does a good job of keeping backups of my most important documents, I don’t put everything in my SVN repository. Mainly, my music and my photos are mostly on a single machine (my Mac Mini). I’ve recently started looking at an efficient way to create large scale backups of all my files at less regular intervals than my SVN commits. I considered using an external hard drive, but even if I backed up all my files, I would really be using only a small amount of the space available. Plus, being a college student, I didn’t want another piece of hardware to lug around. Also, I wanted a backup that I could access no matter where I was — something online.

I would look at the Web 2.0 solutions, but none of them are cost-effective for the large amounts of data I plan on backing up. My music weighs in at around 16GB and will keep growing. My photos are about 2GB and will also keep growing. I would also like to backup copies of software discs I have paid for. There aren’t many of them, but still a few gigabytes worth. Add to that are all my regular documents and other files. All told, I’m looking at something in the range of about 25GB for a first time backup and growing over time. Considering that I’m a starving college student, the cheaper the better.

Enter Amazon Simple Storage Service

Amazon S3 is an industry standard storage solution that can easily handle many terabytes of storage and bandwidth. Whats also important is that it is very cheap. For the amount of storage I’ll be using, I’ll be spending 15 cents per GB per month and 10 cents to upload each gigabyte. Here’s a rough calculation for what my costs will be like:

Initial Backup: ($0.15 perGB * 25GB) + ($0.10GB *25GB) = $3.75 + $2.50 = $6.25

Monthly Running Cost: ($0.15 perGB * 25GB) + ($0.10GB *2GB) + ($0.17 * 1GB) = $4.12

For the running cost I estimated an upload of 2 GB a month and 1GB download. Though this is probably (especially the download) more than what I actually will be using, its should be a fairly good estimate since the amount stored will be gradually going up. So for less than the price of a regular lunch I’ll be able to keep all my important files safely backed up in a safe online location.

The Catch

The catch is that Amazon S3 is not meant to be a storage solution for general users. It’s an enterprise quality system made to plug directly into a high performance online service. As a result S3 offers a fully functional API to write programs around it, but there’s no easy to use interface for users to manage their uploads. Luckily there are a number of third party tools available that fill the need. Here’s a somewhat outdated list of some available tools. The client that I’ll be using is called JungleDisk. It’s a wonderful cross-platform tool that maps your S3 storage as a storage drive on your computer. This means that you can use it as you would have storage disk attached to your machine, and you can also run scripts that automatically backup data to your S3 from other parts of your computer. JungleDisk also provides its own automation facilities to regularly backup your data. No more having to remember to backup once a month.

JungleDisk costs $20.00, but I think that’s an acceptable price, considering that you can install on as many machines as you like (including Mac, Windows and Linux systems) and you get free lifetime updates, meaning you never pay for anything again. For a dollar a month you can get the JungleDisk Plus service that lets you access your files via a web interface, allows resuming uploads if they are interrupted and lets you upload only changed parts of large files (hence saving upload costs). At this point, I don’t think I’ll have a need for Plus, but it’s a good choice if you travel a lot and plan on using S3 as your primary syncing mechanism.

Starting next month…

I’ll be backing up to S3 regularly via Jungle Disk. I plan on making the initial transfer over this weekend (while recovering from Halloween parties). Before that I need to get my files organized and decide on what I will backup and what I won’t. I’ll post a followup once I’ve been using the service for a while to see if it really is worth the cost.

CherryPal: Is cloud computing finally here?

Slashdot just linked to an article about the CherryPal, a lightweight computer designed to run applications off the Internet rather than locally installed ones. The CherryPal is light both in terms of size and power. While it’s only 10.5 ounces and uses just 2 watts of power, it’s also powered by a low-power Freescale processor at 400Mhz with just 4GB of storage and 256MB of RAM. That’s incredibly low powered by today’s standards. However unlike other lightweight machines such as the OLPC or the Eee PC, the CherryPal computer isn’t meant to be a stripped down generic computer.

The CherryPal runs an embedded version of Debian Linux, but the operating system is effectively hidden from the user. While there are a few installed applications like OpenOffice, Firefox and Multimedia support, the bulk of the CherryPal’s applications run online and are augmented by 50GB of online storage. Essentially the CherryPal is meant to be just a gateway to the cloud. I’ve talked about cloud computing before and I’m gradually coming to believe that cloud computing is going to be one of the core infrastructures of 21st century computer technology. While I am a bit skeptical about the CherryPal’s ultra-light specs, I think it is certainly a good concept. The question is, will it all actually work?

Two days ago Infoworld ran an article comparing 4 cloud computing architecture providers. The article was interesting for showing that while cloud computing is already a feasible option for those who want to try it, there is still no lowest common denominator for the industry as a whole. The 4 services are quite different and if you should ever be in a position of moving from one to another, I don’t think the experience would be entirely painless (or even possible in some cases).

When cloud computing becomes an everyday reality (which won’t be too long looking at how things are going), then it seems like their will be three-tier structure. The first will be the user: you running a low cost, low power computer, perhaps not very different from the CherryPal. It will have just even power to run a browser with full Javascript and Flash, and maybe a very minimal set of on site software. You’ll mainly use this machine to use online services, such as Zoho’s suite for office work, Gmail for email, MP3tunes to store your music and so on. Of course, all your data will be stored online behind layers of encryption with multiple backups. Some of these providers will have their own servers with their own back-end infrastructures. But a lot of them, especially smaller, newer startups will simply be leasing computing power and storage from the even larger cloud provided by companies such as Amazon and Google (S3 and App Engine for example).

Of course, cloud computing won’t completely replace normal computers. I’d hate to play a 3D FPS over the Photoshop Expresscloud and more intensive graphics application will still need powerful on-site hardware for acceptable performance (though Photoshop Express does show promise). There’s also the fact that many people will simply not trust their data to another party. However, for these people, it would be possible to run their own cloud. With the falling price of hardware and storage, you can even today easily put together a powerful home server and connect to it remotely, thus not having to lug your data with you everywhere. In fact, this will be one of my projects for next semester at school.

So is cloud computing finally here? Not quite. The industry is still in a growing stage and needs to become much more mature before it becomes. We will need to see standardization in the same that the internet has (mostly) standardized around HTML and CSS. Different cloud computing providers will have to work together so that it is possible to pull features from different clouds and change providers if necessary. Hopefully this time around we’ll see a more civilized growth of a powerful new computing paradigm and not a rerun of the browser wars of the 90s.