Meditations on Moore’s Law

As part of my study on parallel programming I’ve been thinking a lot about how processor architectures have evolved and are going to evolve in the not-too-distant future. No discussion of processor architectures is complete without some talk about Moore’s Law. The original formulation of Moore’s Law is:

The complexity for minimum component costs has increased at a rate of roughly a factor of two per year… Certainly over the short term this rate can be expected to continue, if not to increase.

The original statement is a bit more complicated than the generic “number of transistors on a chip double every two years” that is the common phrasing. Firstly, the original states that the doubling occurs every year. That’s a bit too optimistic, the historic rate has been between 1.5 and 2 years for the doubling. But the more subtle (and generally missed) point that Moore made was that it is density at minimum cost per transistor that increases. It’s not just the density of transistors, but rather the density at which cost per transistor is lowest. We can put more transistors on a chip, but as we do so the chances that a defect will cause the chip to not work properly will increase.

There are a number of corollaries and by-laws that go along with Moore’s Law. For one, the cost per transistor decreases over time but the manufacturing cost per area of silicon increases over time as the number of components crammed onto a chip increases (as well as the materials, energy and supporting technology required to create that chip). But the one consequence that has hit the chip industry in recent years is that the power consumption of chips also increases: roughly doubling every month. And that is something that directly affects the bottom line: no one is going to by a 6 GHz if you need an industrial strength cooling solution to use it. Even though transistor densities (and as a result processor speeds) have been progressing steadily over the past few decades, memory speeds simply haven’t kept up. The maximum speed at which a modern processor can operate is far more than the speed at which it can pull data to operate on.

These two factors: increased power consumption and lagging bandwidth has prompted a significant course change for chip manufacturers. We have the technology to pack a lot of processing power on a single chip, but we quickly hit diminishing returns if we keep aiming for speed. So instead of making the fastest, meanest processors, the industry is turning to leaner, slower, parallel processors. Computing power is still increasing, but it’s increasing in width, not depth. Multi-core CPUs and their cousins, the GPUs¬† exploit Moore’s Law are bundle multiple processing units (cores) onto a single piece of silicon. The top of the line GPUs boast hundreds of individual cores capable of running thousands of current execution threads. Intel’s newest Core i7-980X Extreme has clock speeds of up to 3.6Ghz, but also boasts 6 cores capable of running 12 threads. Parallel computation is here to stay and it’s only going to keep increasing. Moore’s Law should be good for another decade at least (and maybe more until we hit the limits imposed by quantum mechanics) and it’s a safe bet to assume that all those extra transistors are going to find their place in more cores.

Having dozens or hundreds of cores is one thing, but knowing how to use them is quite another. Software makers still don’t really know how to use all the computational width that Moore’s Law will continue to deliver. There are a number of different ideas on how to run programs across multiple cores (including shared-state threads and message passing), but there doesn’t seem to be a consensus on how best to go about it. Then there is the problem that we have billions of lines of serial code that will probably not benefit from multi-core unless they are rebuilt to exploit parallelism. Anyone who’s tried to re-engineer an existing piece of software knows that is not an easy task. It’s also expensive, not a good state of affairs for a multi-trillion dollar industry.

Luckily there are a lot of really smart people working on the matter from a number of different angles. The next few years are going to be an exciting time as operating systems, programming languages, compilers, network technologies and all the people working on them try to answer the question of what to do with the all the cheap computing cores that are lying around. The downside is that software won’t run any faster for a while as clock speeds stagnate and we figure out how to work around that. For the next 2 months I’ll continue reading up and thinking about the different ways in which we can keep using Moore’s Law to our benefit. The free lunch may be over, but that doesn’t mean that we shouldn’t eat well.

MapReduce to the rescue

Searching and sorting are common problems given to computer science students. They are also very interesting problems, which have a number of different approaches some of which are better than others (depending on circumstances). Most things can be searched: integers, strings, complex data objects, pretty much anything that can be compared can be searched and sorted. Searching and sorting string data is especially important since it has wide applications in areas such as natural language processing. So here’s a question: how do you search something that is very large (say thousands of gigabytes) and how do you do it so fast that the person doing the search doesn’t even have time to think about the next query before the results are found?

It would be utterly ludicrous to do this with just a single computer. As most people who have used desktop search know, the process can be frustratingly slow. But even if you add dozens, or hundreds of computers, searching can still be a delicate problem. The question then becomes how do you properly utilize your computing resources? Using the old technique of divide and conquer might be a good idea, splitting up the search among numerous CPUs, having them each do a small part and then combining the results. Google’s MapReduce does just that. Each Google search query requires the search of Google’s huge web index which is many terabytes in size. Google has thousands of processors lined up for doing such a job. MapReduce provides the infrastructure for breaking up a single search over terabytes to thousands of much smaller (and hence, much faster) tasks. The results are then combined and displayed to the user in record time.

MapReduce takes its name from two concepts in functional programming. Map takes a function and a list of inputs and then applies that function to each of the inputs in turn, producing another list with all the results. Reduce works by taking a list of inputs (or a data structure of some sort) and then combining the inputs in a specified manner, returning the final combined value. It’s easy to see how this paradigm can be applied to searching or sorting. Map would search small chunks of the given data (in parallel) and Reduce would then combine the various results back together into a single output for the user.

But it’s not just searching or sorting that can use the map-reduce approach. Whenever you have to apply an operation over and over again to multiple independent inputs and then combine the results into a single unified result, you can use some implementation of map-reduce. Things get difficult if you have dependencies between the inputs, and it’s these dependencies that make parallel programming difficult.

So now that I’ve told you about the wonders of MapReduce, you want to go play. That’s understandable, but you’re probably not in charge of one of Google’s data centers and so don’t have direct access to Google’s powerful MapReduce infrastructure. So what do you do? Well let me introduce Hadoop: Hadoop is an open-source implementation of MapReduce in Java designed to make it easy to break down large scale data processing into multiple small chunks for parallel processing. It implements a distributed file system to spread out data over multiple nodes (with a certain amount of redundancy) and processing data chunks in place before combining them back. It’s currently been run on clusters with up to 2000 nodes. Hadoop might not be as powerful as Google’s MapReduce, (after all Google has deep control over both the hardware and software and can fine tune for maximum performance) but it will get the job done. So what are you waiting for? Go find some reasonable gigantic data processing problem and MapReduce it into insignificance.