MapReduce to the rescue

Searching and sorting are common problems given to computer science students. They are also very interesting problems, which have a number of different approaches some of which are better than others (depending on circumstances). Most things can be searched: integers, strings, complex data objects, pretty much anything that can be compared can be searched and sorted. Searching and sorting string data is especially important since it has wide applications in areas such as natural language processing. So here’s a question: how do you search something that is very large (say thousands of gigabytes) and how do you do it so fast that the person doing the search doesn’t even have time to think about the next query before the results are found?

It would be utterly ludicrous to do this with just a single computer. As most people who have used desktop search know, the process can be frustratingly slow. But even if you add dozens, or hundreds of computers, searching can still be a delicate problem. The question then becomes how do you properly utilize your computing resources? Using the old technique of divide and conquer might be a good idea, splitting up the search among numerous CPUs, having them each do a small part and then combining the results. Google’s MapReduce does just that. Each Google search query requires the search of Google’s huge web index which is many terabytes in size. Google has thousands of processors lined up for doing such a job. MapReduce provides the infrastructure for breaking up a single search over terabytes to thousands of much smaller (and hence, much faster) tasks. The results are then combined and displayed to the user in record time.

MapReduce takes its name from two concepts in functional programming. Map takes a function and a list of inputs and then applies that function to each of the inputs in turn, producing another list with all the results. Reduce works by taking a list of inputs (or a data structure of some sort) and then combining the inputs in a specified manner, returning the final combined value. It’s easy to see how this paradigm can be applied to searching or sorting. Map would search small chunks of the given data (in parallel) and Reduce would then combine the various results back together into a single output for the user.

But it’s not just searching or sorting that can use the map-reduce approach. Whenever you have to apply an operation over and over again to multiple independent inputs and then combine the results into a single unified result, you can use some implementation of map-reduce. Things get difficult if you have dependencies between the inputs, and it’s these dependencies that make parallel programming difficult.

So now that I’ve told you about the wonders of MapReduce, you want to go play. That’s understandable, but you’re probably not in charge of one of Google’s data centers and so don’t have direct access to Google’s powerful MapReduce infrastructure. So what do you do? Well let me introduce Hadoop: Hadoop is an open-source implementation of MapReduce in Java designed to make it easy to break down large scale data processing into multiple small chunks for parallel processing. It implements a distributed file system to spread out data over multiple nodes (with a certain amount of redundancy) and processing data chunks in place before combining them back. It’s currently been run on clusters with up to 2000 nodes. Hadoop might not be as powerful as Google’s MapReduce, (after all Google has deep control over both the hardware and software and can fine tune for maximum performance) but it will get the job done. So what are you waiting for? Go find some reasonable gigantic data processing problem and MapReduce it into insignificance.

Don’t implement a bad algorithm

That’s a lesson I learned yesterday, the hard way. We’re currently learning about tree-like data structures in my computer science class and one of our four semester projects was writing a balancing method for an AVL-tree. Though the concept is very simple and fairly efficient (the balancing is logarithmic in the size of the tree on average), it’s very easy to come up with algorithms that seem to do the right thing, but really don’t. And that’s true of a lot of algorithms that programmers come up with. If that were the only thing wrong it wouldn’t be too much of a problem: simply discard the algorithm and create a new one. Unfortunately, the tendency for most student programmers is to simply implement the first algorithm that comes to mind and then start bug-hunting. Though I try hard to avoid doing this, I let down my guard this time. The result as you can imagine was not pretty.

If the algorithm is in itself a good one, which will give the correct result if applied right, ironing out the implementation details is not particularly hard. But in this case, I did not care to check if the algorithm was right. When trying to balance an AVL tree, there are a number of cases which need to be considered and dealt with. There are not a lot of them, but each of them needs a slightly different procedure to be taken care of properly. If I had started by breaking up the problem into each of the sub-problems and then writing code to deal with each case, the problem would have been simple (I finally did that and it took me less than an hour to get everything working). But instead, I tried to come up with a generic algorithm which could be altered to fit each of the cases. Now the idea of building a generic abstraction which can act in a number of different ways based on input and other conditions is a very powerful idea and can greatly reduce the amount of redundant code in a large system (closures anyone?). However, it should be kept in mind that a number of times, its actually easier to implement a slightly cluttered redundant system than spend time and effort trying to create a conceptually more complex system to reduce structural complexity. I made the mistake of not keeping this mind. And my algorithm did not work.

Things would still have worked out if I realized at that point that it would be easier to just follow the book and code solutions for the cases separately (abstracting only the most obvious parts into generic solutions). But I didn’t realize that. I tried to hammer and mutilate my broken algorithm to react as I wanted to the different cases and I kept piling on more and more code, desperately hoping that somehow it would also start working. I wasted a good 8 hours trying to get my abysmal mess of a balancing method to work right and I ended in utter and abject failure. In many ways, those have been my darkest hours since I first started programming about 5 years ago.

But yes, I do have a happy ending. 6 hours of sleep and a physics class later, I finally decided that enough was enough and that I was going to start from scratch, this time following the book’s recommendations and keeping excess complexity to a bare minimum. I was done in under an hour. So what’s the moral of this story? Spend more time thinking than you do coding and don’t implement an algorithm until you are fairly sure that it works the way you want it to to.

Sure, some bugs won’t be revealed until you actually have some code to compile and test, but there’s no way that you are going to code your way out of a broken algorithm. It’s very alluring to just sit down at a computer and start typing away. It makes you feel very productivity because there’s something happening on screen. In contrast, simply sitting down and working out an algorithm in your head often makes you feel that you are not really doing anything. It’s unfortunate that we call what we do “programming” when much of our time and effort is actually spent program planning (or should be). As the preface to Structure and Implementation of Computer Programs says, programs should be written primarily for people to read and understand and only incidentally for machines to execute. It follows from that if you can’t understand your own program and prove its correctness, you shouldn’t expect a computer to do the same. That’s a lesson I learned the hard way and I hope I never have to relearn it.