Searching and sorting are common problems given to computer science students. They are also very interesting problems, which have a number of different approaches some of which are better than others (depending on circumstances). Most things can be searched: integers, strings, complex data objects, pretty much anything that can be compared can be searched and sorted. Searching and sorting string data is especially important since it has wide applications in areas such as natural language processing. So here’s a question: how do you search something that is very large (say thousands of gigabytes) and how do you do it so fast that the person doing the search doesn’t even have time to think about the next query before the results are found?
It would be utterly ludicrous to do this with just a single computer. As most people who have used desktop search know, the process can be frustratingly slow. But even if you add dozens, or hundreds of computers, searching can still be a delicate problem. The question then becomes how do you properly utilize your computing resources? Using the old technique of divide and conquer might be a good idea, splitting up the search among numerous CPUs, having them each do a small part and then combining the results. Google’s MapReduce does just that. Each Google search query requires the search of Google’s huge web index which is many terabytes in size. Google has thousands of processors lined up for doing such a job. MapReduce provides the infrastructure for breaking up a single search over terabytes to thousands of much smaller (and hence, much faster) tasks. The results are then combined and displayed to the user in record time.
MapReduce takes its name from two concepts in functional programming. Map takes a function and a list of inputs and then applies that function to each of the inputs in turn, producing another list with all the results. Reduce works by taking a list of inputs (or a data structure of some sort) and then combining the inputs in a specified manner, returning the final combined value. It’s easy to see how this paradigm can be applied to searching or sorting. Map would search small chunks of the given data (in parallel) and Reduce would then combine the various results back together into a single output for the user.
But it’s not just searching or sorting that can use the map-reduce approach. Whenever you have to apply an operation over and over again to multiple independent inputs and then combine the results into a single unified result, you can use some implementation of map-reduce. Things get difficult if you have dependencies between the inputs, and it’s these dependencies that make parallel programming difficult.
So now that I’ve told you about the wonders of MapReduce, you want to go play. That’s understandable, but you’re probably not in charge of one of Google’s data centers and so don’t have direct access to Google’s powerful MapReduce infrastructure. So what do you do? Well let me introduce Hadoop: Hadoop is an open-source implementation of MapReduce in Java designed to make it easy to break down large scale data processing into multiple small chunks for parallel processing. It implements a distributed file system to spread out data over multiple nodes (with a certain amount of redundancy) and processing data chunks in place before combining them back. It’s currently been run on clusters with up to 2000 nodes. Hadoop might not be as powerful as Google’s MapReduce, (after all Google has deep control over both the hardware and software and can fine tune for maximum performance) but it will get the job done. So what are you waiting for? Go find some reasonable gigantic data processing problem and MapReduce it into insignificance.