Archive for the 'Productivity' Category

Blog posts or essays?

Between my work, traveling back to home and frequent power cuts, my blogging hasn’t been very regular recently. I haven’t been suffering from any sort of writer’s block, in fact I have a list of about 7-8 topics that I’d like to write about. However there has been one thing that has kept bothering me for quite some time: the size of my postings. I’ve been trying to use this blog as a way to tell the world about the things that I learn and discover as I pursue my career as a computer science student. However many of the things that I deal with daily and which I think about are quite complex and takes long discussion to get everything together. At the same time I would like to be able to post new things everyday or at least every alternate day. Often these two things don’t really go together add being an avid reader myself, I understand that it can be very trying to read something long on a topic like computer science. Hence the question: do I write small compact blog posts on a regular basis, or do I write longer essay-style posts where I can talk at length about the topic?

I’ve been looking at some of my favorite technology oriented bloggers to possible solutions to my dilemma. My favorite bloggers is Steve Yegge, who without fail writes long, sometimes rambling, but always interesting essays on an approximately bimonthly basis. While I find his essays thoroughly entertaining, they are a bit too big for something that I would want to write. More importantly, I certainly want to post more regularly than twice a month. Paul Graham’s essays are somewhat shorter, but are also published at a similar frequencies. Again, brilliant, but not quite what I’m aiming for.

Perhaps the closest to what I’m acheiving would be Jeff Atwood’s Coding Horror. Atwood posts regularly (almost everyday) and his posts are of a good length, long enough to make you feel a sense of actually reading something worthwhile, while being short enough that you don’t need to set apart an entire hour to go through them. Though occasionally he does err on the side of excess, they are nowhere as long as Yegge’s posts.

Of course, length isn’t a separate consideration in itself. It’s closely tied to the content of what I write. Recently I’ve been writing more from a software engineering standpoint, though I would like to write posts of a slightly more theoretical nature (especially since I’m getting increasingly interested in compilers and programming languages). While I’m willing to accept that such topics might require slightly lengthier posts, I really don’t want to turn my posts into mini-theses.

Long blog posts also mean longer time investments on my part, something that is a very important consideration because of the heavy course load I plan on taking. Perhaps the best way for me to decide the issue is to think about how much time I would be willing to invest on a daily basis. On average a blog post right now takes me about 40 minutes to one hour to write. I think that it is a good amount of time for me to spend write now. Considering my typing speed, that translates to about 1000 thousands, even considering looking up pertinent links and confirming information. 1000 words might be pushing it a bit, (that’s about one average sized college paper), but much less would probably be too small for me to clearly say everything. 800 to 1000 words seems like a decent size from what I’ve been reading. I think a good idea would be to to have a number of sections which are more or less self-contained in terms of content.

I’m going to trying to work on trying to control my size and structure. However at the same time, my primary concern will be content, so even if I need to write longer or shorter posts to give a coherent, well paced account of everything I need to say, so be it. I’m sure my readers read other tech blogs, so any comments as to what you prefer would be good very much appreciated.

Book Review: Beginning Ubuntu Linux Third Edition

Ubuntu and desktop Linux have come a long way in the past few years. Ubuntu is currently one of the most popular, if not the most popular distro for desktop linux users. It was my first distro and though I no longer use it, I’ve always acknowledged to be a well-polished piece of work and I always recommend it to people who are just starting on their personal Linux journey. Like most other things in computers, getting used to a new operating system is made easier if there is a good source of documentation available. Beginning Ubuntu Linux, published by Apress is a particularly good example of documentation geared towards to the new user. I’ve reviewed the previous versions of the book and I find that the books have kept improving just Ubuntu itself.

One of the things that makes this book particularly appealing for me is that it starts out with a brief but informative review of the philosophy and history surrounding Linux and Ubuntu. I personally believe (and I think that many other Linux users share this) that there is much more to Linux and open source software than simple technical excellence. It is a way of thinking and acting that I find very appealing and which I wish others to understand. This book does its part in helping new users understand the culture that gave rise to the software that they will soon be using.

The book continues the practice of understanding that most of the people reading it will be Windows users. As a result the chapters dealing with installation also tell users how to properly back up their data and how to smoothen the transition. The guide through the actual installation process is also very in-depth and well written. Partitioning is often the most confusing part of the installation for a new user. I’m glad to see that partitioning has been dealt with very well with all the options in the install process carefully explained and the pros and cons weighed carefully. The chapter dealing with common installation problems is as good as before but now includes information on more than just installation problems. I particularly liked the section on how to deal with resolution and other common graphics problems since these can be very frustrating if not dealt with properly.

Once installation is complete the book goes on to describe with an equal amount of care how to perform various day to day tasks and how to customize your system. The section that deals with Linux equivalents is also comes in very handy for new users who just want to be pointed quickly in the right direction. The book geos beyond describing simply the core operating system and the user interfaces. Of particular note are the sections devoted to how to use multimedia systems. You’d be hard-pressed to find a computer user who doesn’t have a substantial collection of various music and video files and this book helps newbies get up and running with minimal effort. This new edition keeps the sections on using OpenOffice and the BASH shell but adds substantial material regarding the new automatic multimedia setup, the 3D graphics effects that have the Ubuntu desktop much more visually appealing and also on security and encryption. There is also a mini-tutorial on using the GIMP for basic image manipulation which I think shutterbugs will find handy.

The last part of the book is devoted to slightly more advanced topics such as package management, backups and automation and remote access. Personally I feel that package management deserves a more central place, right alongside installation, but the book’s modular structure means that this isn’t much of a problem. Overall the last few chapters act as a springboard from where newbies can start another journey to the level of power user and beyond.

The book as a whole is well laid out and material is clearly separated. The use of sidebars and small tips and warning sections means that a good amount of extra information is presented without interrupting the main flow of the text. There are also lots of links to other information sources where the interested reader can go to for more in-depth information. Most of these are freely available online sources and the full URL is often provided resulting in minimum effort for the reader. For Windows and Mac OS X users there are also pointers to third party tools that might make migration easier. The book is replete with high-quality black-and-white screenshots which add to the complete guide experience that the book provides. The third edition updates everything to be in sync with Ubuntu 8.04 and comes with a double-sided DVD containing a ready-to-install image on one side and various ISO images of the Ubuntu derivatives on the other. In essence the book has everything that a user would need to get up and running with Ubuntu.

The book is at a reasonable price of $39.99 and I think it’s a good investment for anyone looking to jump into the world of Linux. Even though you’ll get the most from this book in the first few weeks after installing Ubuntu for the first time, the later parts of the book will serve as a handy quick reference for those types you find yourself needing to dig under the hood. There is certainly a large amount of information online which means that books of this type are not strictly necessary, but at the same time it can make things a lot easier to have a quick reference close at hand. My litmus test for this sort of this sort of product is generally: would I give it to my mom? This time the answer is yes.

Should we make heavy software

When I showed off Firefox 3’s smart address bar to a friend of mine a few days ago, his first reaction was: “Wow, that’s cool” and half a smile. A second later the smile was gone and he let out a disappointed “Oh but that’s going to take up so many resources”. Let’s leave for the moment that my friend was not the tech-savviest person in the world and that he would probably not be able to list exactly what those “resources” were supposed to be. What his remark got me thinking was simply: “Is it really a bad thing if my software takes up many resources?”

I suppose the core issue here is that computer users don’t really care about resources. Most of my friends wouldn’t be able to tell me the specs of their computers and much less what those specs meant. What they are interested in is quite simply, speed. They want their software to be zippy and fast, so do I, so does everyone (I hope). When my friend said that Firefox 3 would take up more resources, what he really meant was “My computer will run slower.” He quite innocently equated less features to mean lightweight and hence faster. It’s an honest mistake, but that doesn’t change the fact that it is not really correct. Tweaks and customizations to programs can often increase actual program size and complexity but give better performance. Apple has been doing with OS X for years now. Many users and benchmarks will vouch for the fact that though OS X has been gaining features, it’s also utilizing computer resources more efficiently.

Of course, it’s undeniable that the general trend is towards programs that use more resources. After all, more resources does mean more maneuvering room, more stuff to build with. And resources to keep increasing. Our computers today are vastly more powerful than those available at even the start of the decade. And we’ve been reaping the benefits with more sophisticated software: better visuals, increasingly powerful desktop search and increasing higher resolution data formats.

At the same time, many people both programmers and not, are becoming increasingly worried that modern software is bloated an unwieldy. Just as Moore’s Law has been giving us faster processor, our software acts as a Moore’s Law Compensator. Our computers still take a long time to boot up and become usable, most programs don’t start in the blink of an eye, in essence, somehow the user doesn’t really see all the power that’s tucked under the hood. This has led to a growing trend in lightweight software, especially in the more tech-savvy community.

Among Linux users this dual nature of modern software is very evident. On the one hand Linux systems can now sport powerful 3D window managers and task switchers good enough to rival Vista or Leopard. On the other hand, there have been a wave of new minimalistic window managers lacking ant graphic splendor whatsoever.

I feel myself personally affected by this double trend: I myself use a tiling window manager called Awesome. I shun IDEs, preferring to use the very lightweight Vim from a simple terminal. Firefox is just about the only graphical program that I run. At the same time, I love OS X Leopard. I think the UI is quite beautiful and I’ve become an avid user of both Expose and Spaces. Not to mention the fact that somehow fonts seem to look much better on OS X than on any other operating system I’ve ever used. Performance-wise OS X is much better than Vista which takes much longer to start up and respond on a far superior system.

Being a programmer myself, this issue is particularly important because I think about performance almost all the time. Whether it’s deciding whether to use a high-level language like Python, or lug it out with something closer to the metal like C, or which graphics toolkit to use, I have to factor in resource-use. I’m still not sure about which way is better. Software bloat is bad. Very bad. At the same it doesn’t make much sense to let all that computing power sit there unused and the benefits very often outweigh the price to be paid. For the time being, I’m calling it a truce and letting the decider be something less tangible than performance benchmarks: user experience. If you can use heavy resources but deliver a solid user experience, then go for it. An incomplete user experience for the sake of a lighter program might be occasionally justified, but not all the time. However a bad user experience along with heavier system requirements is definitely a bad thing, to be avoided at all costs. After the best software is not the one that uses least RAM or has the prettiest interface, it’s the one that gets the job done without getting in the users way.

The role of software engineering in research

As my research project continues our software is gradually growing more and more features, and we’re adding new functions almost everyday. While it’s certainly a very good feeling to be adding functionality, it can also get ungainly very quickly. About two weeks ago, Slashdot posted a very interesting query by a person interested in doing a PhD on software design principles. Since then, I’ve been giving more and more thought to the role that industry-style software engineering principles play (or should play) in computer science research.

Research in science has been mostly held to be separate from engineering. Chemistry research is not performed in quite the same way that chemical engineering is. Even in cases where there is significant amounts of high technology equipment is used (the LHC is a good recent example), the engineering teams are separate from the scientists. However, computer science is not really a typical science at all. Research in computer science faces many of the same problems that software development faces:

  • Multiple people working simultaneously on large and growing codebases (team management).
  • Adding or removing features without requiring major rewrites (extensibility).
  • A way to collect large amounts of data and detect failures (testing and error logging)
  • Having multiple copies of the experiment running, with different parameters (scalability and version control).
  • Last, but not least, the experiment should actually work and produce results (in corporate terms, the software should ship).

Keeping the similarities in mind, what lessons can we as researchers learn from industry techniques and practices (by industry, I mean open source as well)? I’m not sure what other researchers do, or if there is some sort of standard procedure, so I’ll just talk about what my small team has done to make our work easier.

Divide and conquer:

The first thing we did was to carve up the various parts of the project. While my two teammates took up the core engine and the output system, I chose to deal with the human interaction part. At the same time. we had regular meetings where we planned out how our parts should interact as well making high-level descriptions of what each part of the code should do. This meant we could each develop our code largely independent of the others while still having a fair idea of where the whole project was going.

Throwing away version 1:

Though our first version was a working program, we made it keeping in mind that we would probably have to change a lot of things, possible even rewrite it. As a result, we developed it quickly, and focused on making it “just work” without adding too many features. This allowed to get a good feel for what the rest of the project would be like and let us make important decisions before we had invested a lot of time into it. In research you often don’t know what you’re looking for until you start looking, and this was the case for us. Our throwaway version gave us a better idea of which direction our research was heading, what concepts had been explored before, what was new and how difficult various aspects were.

Strict Version control:

Right after we started working on our post-throwaway system, we moved all our code to a Subversion repository on our college’s research cluster. We do are own work in our working commits and perform daily commits and updaes. Our repository is also divided into trunk, branch and tag directories to keep separate the most recent copy, older releases and other milestones. This way we can see the evolution of our project at a glance more easily than having to check out older versions. This comes in particularly handy, since being a research project we tend to be looking at what we did before and what results we got rather regularly.

Coding conventions:

Another post-throwaway decision was regarding coding styles. We fixed general naming and commenting styles, decided on tabs vs. spaces (4-space indent, in case you’re interested) as well as a number of minor things that weren’t project-threatening, but could have led to unnessecary confusion later on. We also keep detailed records of how we’ve changed our codes and any bugs we’ve introduced/fixed on a common wiki meaning that everyone is always up-to-date on changes.

Personal freedom:

By standardizing some aspects of our project, we’ve been able to have greater freedom as individual researchers. All our code is in platform-independent Python. Besides that our only other common tool is Subversion, which is also freely available and easy to use. By choosing open source and platform-independence, we have been able to have three developers work on three different platforms using different editors, IDEs and SVN clients. We can work equally well on our own machines sitting in our rooms, on the computers in our CS department’s labs, or even from the Mac Pro computers in the Arts buildings via SSH. It’s hard to quantify how much easier this independence has made our work. This has meant that we didn’t have to rush to the CS Lab’s whenever we had a thought or problem and we could whenever we wanted, however we wanted as long as we stuck to some simple rules. Scientists and programmers both need a certain amount of freedom and we’ve been able to secure this freedom by a proper choice of tools and techniques.

Things to do:

It’s been almost four weeks since we started using software engineering practices for our research work. We don’t know how things would have turned out had we not implemented the practices that we have. However we have come to rely heavily on the framework and guidelines that we have built for ourselves. At the same time, we’ve come to realize that there are parts of our project that could benefit from adopting some more industry-standard practices. I’ll end with a brief list of the things we plan to implement in the next few weeks:

Robust error handling: We’re currently using some very ad-hoc methods to handle errors and track down bugs. But we’re in the process of building a flexible, yet robust error handling and reporting system into our project. Since we intend our tool to be used by non-programmers, we need to implement a system that won’t mean looking at the code everytime something goes wrong.

Flexibility via a strong framework: Our work has turned out to have more applications than what we originally intended for it. As a result, we were considering forking our code into two separate branches. Though we still haven’t reached a proper decision regarding this, I’ve personally been trying to redesign our project into a “framework + dynamic modules” system. This is purely a software engineering effort, since it will mean a lot of refactoring and restructuring, but will not have much of an external effect rather than making our job easier.

Unit testing and other automation: Till now, an important part of our work has been getting user feedback, partly because our system is very visual intensive and also because we weren’t quite sure what we were looking for. Now that we have a better idea of where our project should be headed and what we should be looking for, we can begin to automate some parts of our work. However we won’t just be checking for bugs, but rather generating a large number of results and then further improve our system to get what we want.

I’d love to hear from anyone who’s been involved in computer science research projects and has opinions of how to use software engineering principles to make things easier/more effiicient.

Use your own software

Also known as “eat your own dog food”, this is the concept behind one of the most successful software engineering projects of modern times: the Windows NT kernel. The Windows NT kernel was written by a highly-talented team led by a man who is arguably one the best software engineers of all time: Dave Cutler. Dave Cutler was also the lead developer for another groundbreaking operating system: Digital’s VMS. However there was more to this project than talented developers: the whole team actually used Windows NT everyday as soon as possible. This meant that the developers exposed themselves to problems that would be encountered by the average user and could then fix those problems before actually shipping the finished product.

Let’s face it: software is buggy. We still have no clue about how to reliably write bug-free software. So we’re stuck with the situation of writing buggy software and then wrangling the bugs out of them. A lot of bugs are removed through the process of just writing a working piece of software. Automated testing also gets rid of a fair amount of bugs. However, there are some things that no amount of debugging or automated testing can get rid. Modern software systems are large and complicated and it’s hard to tell exactly how all the different parts will interact until you actually start using it all.

Even if you’re certain that your code is relatively bug-free, it’s still important to use software that you’ve written. There are a lot of things about software like look-and-feel, intuitiveness, ease of use, which can’t be determined automatically. The only way to see if your program has an elegant and smooth interface or is powerful, but clunky is to use it repeatedly. When you start using software that you’ve written on a regular basis, you start to think about how your software can be improved, what are the bottlenecks and hard-to-use features, what features are missing and what are unnecessary. This constant evaluation is a key ingredient of making better software.

Unfortunately, most of the software that is being created is made by developers who really don’t know how that piece of software is going to be used in the real world. After all, Photoshop wasn’t made my a team of artists and most users of Microsoft Word have never written a line of working code. So how do programmers go about creating software that they might never use? Enter the beta tester. Beta testers are given pre-release versions of software to evaluate and their suggestions are then folded back into the next release. The best beta testers are the very people for whom you’re writing the software in the first place. If you’re writing a software package for a specific client, then it is essential to have a continuous dialog open. Test versions should routinely be given out to get feedback and then that feedback should be incorporated into the next edition. If your software is for a mass market, then your user community will be your pool of beta testers, encourage them to give feedback and then take those opinions into account. Eating your own dog food is a good idea, but it’s not a disadvantage if you can give it to a bunch of other dogs and see what they think of it.

Throw away version 1

One of the best books on software engineering is Fred Brooks’ The Mythical Man Month. Even though it’s over thirty years old, most of the principles and ideas laid down are still applicable, some even more so now than thirty years ago. One of the ideas that I feel deserve special mention is that of the pilot system.

Like many great ideas, this one is very simple in essence. The first working system created as part of a software project is one that will be thrown away (intentionally or not), in the same way that chemical engineers will first build a pilot plant to test out their process before investing in a full scale plant. The reason for this is that it’s hard to get a grasp for what the real problems of a software system will be before some of it is actually implemented. Keeping this in mind, it’s best to start projects with the idea that version 1 will be thrown away and schedule the project accordingly.

The benefits of starting with a “throw away version 1″ philosophy are many and can help the development of any software project, large or small. Knowing that the first working version doesn’t need to be perfect lets the development team quickly create a prototype to explore the different aspects of the project. It allows experiments to try out various solutions to a problem before deciding which one to use. If you’re making software for customers (as most people are) having a working model early in the project’s schedule allows you get valuable user feedback and make changes accordingly. It’s often said that software users don’t really know what they want, but it makes everyone’s life a little bit easier if there is something tangible that can be pointed to and specific aspects of it praised or criticized.

Of course, knowing that you’re going to throw away the first version doesn’t mean that this version should be made sloppily or inconsistently. The same technical rigor should be applied to the prototype as to the final delivered product. In fact, making version 1 in a half-hearted fashion defeats the entire purpose of having such a version in the first place. Problems in the design of the product will only become apparent if version 1 reflects fairly accurately what the final is supposed to be. Engineering models are made to the same specifications and requirements as what the final should be (albeit on a smaller scale) and there is no reason why software prototypes should be any different.

I’ve had the opportunity to see the benefit of this approach first hand. We recently completed version 1 of our summer research project. Though we had known that the final version would probably be very different, we worked under the assumption that version 1 would be the basis for the later version and that if we screwed up now, we would have to pay for it later. The result is that version 1 really was a proper working system, which works (from the user perspective) in a manner very close to how we want the final to work. Of course, we are still throwing away version 1. Even though it works fine, we’ve seen that there are a number of fundamental flaws to our approach. The ad-hoc parser that I wrote for our configuration language works, but is ugly and inflexible. I’ve already replaced it with a more robust and flexible recursive descent parser. The output mechanism is currently hard-coded into the rest of the system, but it should be pluggable; we now have an idea of what sort of an architecture we want. In fact, the very core of our system will probably have to be replaced or at least changed substantially, because we just realized that what we had written wasn’t really what we had set out to do. All this might sound like a disaster, and it would have been if version 1 wasn’t designed with the idea of eventually throwing it away. We used it as a test bed and learned valuable lessons from it which we’ll now apply to the creation of version 2.

Unless your project is something very trivial, it’s best to start with the idea of throwing away version 1 and learning from the lessons that it will inevitably each. There is another software engineering principle that runs complimentary to “throw away version 1″ that can be best described as “eat your own dog food”, but that’s something for another post, maybe tomorrow. Till then, have fun with version 1.

How many programming languages should I learn?

OSNews has started a series called A-Z of programming languages where they’ve been posting interviews with the creators of well-known programming languages. Till now they’ve done AWK, Ada and BASH. Of those three the only one I’ve had any experience with is BASH, and not too much of that. But considering that there are literally hundreds of programming languages (and many more dialects or implementations) which ones should one learn to be a good programmer?

I know that many programmers out there simply learn just one or two languages and then use them throughout their careers (or at least until it becomes impossible to find a job). Certainly that works, to some point at least and so the question is, do we really need to learn languages that are not the “industry standard” (i.e. whatever has the most jobs on offer). If all you’re interested in is a job, then no, you don’t. One or two languages will probably be enough. However, if you want to keep learning and keep developing as a programmer, then the answer is most certainly yes. Some programming languages are quite similar in terms of syntax and power, but some are very different and teach you think in different ways. It’s these different languages that are going to make you better as a programmer.

So we come back to our initial question: How many languages to learn and perhaps more importantly which ones? I think that there are two types of languages that are worth learning: Those that make you think differently and those that have been used to write a large amount of high quality code. Languages like Smalltalk and LISP and to some extent Java are strongly paradigm-oriented, they emphasize a specific style of programming, in the case of Smalltalk it’s object-oriented and in the case of LISP and it’s derivatives, it’s pure functional. Such languages will teach you important lessons which you can apply even when you’re using some other languages.

One of the languages that has been widely embraced by the hacker community is C. There is a incredible amount of really good code written in C, the most famous of which is probably the Linux kernel. Unless you’re a systems programmer, you probably won’t have to use C or C++ much, but you can benefit a lot from reading well written code. C and C++ can be used to write powerful code, but sometimes the power doesn’t quite justify taking the trouble of all the low-level work that you have to do. In that case, it’s good to have a general purpose high-level language lying around. I would strongly recommend Python, but you might find another language easier for everyday use. But once you have made a choice, learn it well and use it to it’s maximum.

It might be worthwhile learning a language that has powerful text processing abilities, like AWK or Perl. It might make your work easier if you know one or both. And there is an awful lot of code written in Perl for the purpose of gluing larger programs together, so it might be worthwhile to learn it. However, Perl has been falling from grace for a good few years and many people are now using Python and Ruby to the same things they used Perl for. I don’t have a concrete opinion of Perl at the moment, but I think it’s something you can put of learning until you have an actual need for it.

You should also learn Java. I personally consider Java to be a decent language, but not a good one and I probably wouldn’t use it if I had a choice. However, it is a very popular one and if you get a job as a programmer, chances are you’ll encounter a substantial amount of Java code which you have to deal with. And you won’t be much of a programmer if you can’t deal with other people’s code. So learn Java.

Knowing basic HTML or CSS is also a good idea. You might not be a full fledged web artist, but you should be able to throw together a decent web page without much trouble. Considering the growing importance of the web, learning a web programming language is becoming important. It’s not quite a necessity yet, but I think in less than 5 years it will be. I can’t recommend one now, because I have no experience, but I think that Ruby might be a good idea, because it’s a decent general purpose language as well.

I should say, that I don’t know all of the above, but I have had some experience with each of them. I think each of them have contributed to making me a better programmer, and that the more I delve into them and use them for harder problems, I will continue to improve. In conclusion, I would like to say that if you are committed enough to learn multiple languages well, you should also invest some time in learning a powerful text editor such as Vi or Emacs. Though you can certainly write great code with nothing more than Notepad, using a more powerful tool can make your job quite a bit easier (and considerably faster). Once you turn fine-tuning these editors to suit your style and habits, you won’t want to use anything else. If you’re seriously out to become the best programmer you can be, you’re going to want the best possible tools at your disposal.

I’ll be happy to here your comments on what programming languages you might recommend to anyone looking to improve their programming.

MapReduce to the rescue

Searching and sorting are common problems given to computer science students. They are also very interesting problems, which have a number of different approaches some of which are better than others (depending on circumstances). Most things can be searched: integers, strings, complex data objects, pretty much anything that can be compared can be searched and sorted. Searching and sorting string data is especially important since it has wide applications in areas such as natural language processing. So here’s a question: how do you search something that is very large (say thousands of gigabytes) and how do you do it so fast that the person doing the search doesn’t even have time to think about the next query before the results are found?

It would be utterly ludicrous to do this with just a single computer. As most people who have used desktop search know, the process can be frustratingly slow. But even if you add dozens, or hundreds of computers, searching can still be a delicate problem. The question then becomes how do you properly utilize your computing resources? Using the old technique of divide and conquer might be a good idea, splitting up the search among numerous CPUs, having them each do a small part and then combining the results. Google’s MapReduce does just that. Each Google search query requires the search of Google’s huge web index which is many terabytes in size. Google has thousands of processors lined up for doing such a job. MapReduce provides the infrastructure for breaking up a single search over terabytes to thousands of much smaller (and hence, much faster) tasks. The results are then combined and displayed to the user in record time.

MapReduce takes its name from two concepts in functional programming. Map takes a function and a list of inputs and then applies that function to each of the inputs in turn, producing another list with all the results. Reduce works by taking a list of inputs (or a data structure of some sort) and then combining the inputs in a specified manner, returning the final combined value. It’s easy to see how this paradigm can be applied to searching or sorting. Map would search small chunks of the given data (in parallel) and Reduce would then combine the various results back together into a single output for the user.

But it’s not just searching or sorting that can use the map-reduce approach. Whenever you have to apply an operation over and over again to multiple independent inputs and then combine the results into a single unified result, you can use some implementation of map-reduce. Things get difficult if you have dependencies between the inputs, and it’s these dependencies that make parallel programming difficult.

So now that I’ve told you about the wonders of MapReduce, you want to go play. That’s understandable, but you’re probably not in charge of one of Google’s data centers and so don’t have direct access to Google’s powerful MapReduce infrastructure. So what do you do? Well let me introduce Hadoop: Hadoop is an open-source implementation of MapReduce in Java designed to make it easy to break down large scale data processing into multiple small chunks for parallel processing. It implements a distributed file system to spread out data over multiple nodes (with a certain amount of redundancy) and processing data chunks in place before combining them back. It’s currently been run on clusters with up to 2000 nodes. Hadoop might not be as powerful as Google’s MapReduce, (after all Google has deep control over both the hardware and software and can fine tune for maximum performance) but it will get the job done. So what are you waiting for? Go find some reasonable gigantic data processing problem and MapReduce it into insignificance.

Design for unit-testing

I’ve written before about the role of testing in programming and as I’ve written more code (and unit tests) over the past few weeks, my conviction that unit-testing is useful for more than just determining program correctness has become even stronger. In my previous post I spent about a paragraph exploring the other benefits of testing, I think it’s time that I offered a more detailed view of the alternate advantages of unit-testing.

As I started working on my last project for the semester yesterday, one of the things at the top of my mind was that the professor would be running his own unit-test on our programs. Our other projects had clear descriptions of what methods we were to create and how they should behave. But this project, a simple word counter written in Java, was different in that we were only told how the program as a whole should respond. We could shape the internals as we wished. Though it was very tempting to create a monolithic system (especially since the problem was rather simple), I decided that I should design with automated unit-testing in mind. Rather than making design complicated, this choice actually made things clearer.

It was obvious that the UI would have to be completely separate from the actual work portion of the code. Not only that, but that values returned from the methods doing the work would have to be in a form so that they could easily be compared to expected results. This meant that while the UI would deal with output formatting, there should not be very much formatting required in the first place. This guided me in the choice of the data structures that would eventually be returned.

Another area in which designing with testing in mind is useful is in determining if methods should be private. JUnit 3.8 does not support direct of testing of private methods. There is a certain amount of debate regarding whether or not this is a good thing, but this restriction does force a design methodology that can be beneficial. The only methods that should be private are the ones that need not be tested separately, that is, if the calling method passes testing, it automatically means that the private method is also working correctly. Though I didn’t need any private methods in this particular project, I did before and keeping the limitations of private testing in mind resulted in what I believe to be cleaner, more readable code.

Of course, it is easy to go overboard with unit testing. Your program shouldn’t be endlessly subdivided into lots of tiny functions just so that you tests every tiny chunk of code. Unit testing can help you design cleaner simpler program, but only if your design is pragmatic to start with. Like I’ve said before, no amount of coding tricks and development methodologies will fix a fundamentally broken design.

Jeff Atwood of Coding Horror said some time ago that unit-tests should be first class language constructs. At the time I thought that this was a bit overboard, but I’m coming to realize that this might be a good thing. Any language which has a decent error-handling mechanism would be able to bake in support for unit testing and having a unit-test mechanism directly in the language would probably encourage students to use it (and more importantly teachers to teach it). In my CS course we’ve using Unit-tests from early on, which I feel was very good decision on the part of the professor (even though many of my fellow students don’t quite seem to grasp it’s full importance right now).

Designing for unit testing encapsulates a lot of the ideas that are a part of good software design: UI separation, abstraction, code reuse and readability. Unit-testing is also a perfect example of abstraction: just worry about the big picture and the details will take care of themselves. So the next time you find yourself dealing with a big project, design for unit-testing and chances are you’ll be making better code than if you weren’t. Keep in mind though that no amount of unit-testing will replace actual user-testing…so make sure you get around to that at some point as well (hopefully as soon as you have something a user can actually use).

Don’t implement a bad algorithm

That’s a lesson I learned yesterday, the hard way. We’re currently learning about tree-like data structures in my computer science class and one of our four semester projects was writing a balancing method for an AVL-tree. Though the concept is very simple and fairly efficient (the balancing is logarithmic in the size of the tree on average), it’s very easy to come up with algorithms that seem to do the right thing, but really don’t. And that’s true of a lot of algorithms that programmers come up with. If that were the only thing wrong it wouldn’t be too much of a problem: simply discard the algorithm and create a new one. Unfortunately, the tendency for most student programmers is to simply implement the first algorithm that comes to mind and then start bug-hunting. Though I try hard to avoid doing this, I let down my guard this time. The result as you can imagine was not pretty.

If the algorithm is in itself a good one, which will give the correct result if applied right, ironing out the implementation details is not particularly hard. But in this case, I did not care to check if the algorithm was right. When trying to balance an AVL tree, there are a number of cases which need to be considered and dealt with. There are not a lot of them, but each of them needs a slightly different procedure to be taken care of properly. If I had started by breaking up the problem into each of the sub-problems and then writing code to deal with each case, the problem would have been simple (I finally did that and it took me less than an hour to get everything working). But instead, I tried to come up with a generic algorithm which could be altered to fit each of the cases. Now the idea of building a generic abstraction which can act in a number of different ways based on input and other conditions is a very powerful idea and can greatly reduce the amount of redundant code in a large system (closures anyone?). However, it should be kept in mind that a number of times, its actually easier to implement a slightly cluttered redundant system than spend time and effort trying to create a conceptually more complex system to reduce structural complexity. I made the mistake of not keeping this mind. And my algorithm did not work.

Things would still have worked out if I realized at that point that it would be easier to just follow the book and code solutions for the cases separately (abstracting only the most obvious parts into generic solutions). But I didn’t realize that. I tried to hammer and mutilate my broken algorithm to react as I wanted to the different cases and I kept piling on more and more code, desperately hoping that somehow it would also start working. I wasted a good 8 hours trying to get my abysmal mess of a balancing method to work right and I ended in utter and abject failure. In many ways, those have been my darkest hours since I first started programming about 5 years ago.

But yes, I do have a happy ending. 6 hours of sleep and a physics class later, I finally decided that enough was enough and that I was going to start from scratch, this time following the book’s recommendations and keeping excess complexity to a bare minimum. I was done in under an hour. So what’s the moral of this story? Spend more time thinking than you do coding and don’t implement an algorithm until you are fairly sure that it works the way you want it to to.

Sure, some bugs won’t be revealed until you actually have some code to compile and test, but there’s no way that you are going to code your way out of a broken algorithm. It’s very alluring to just sit down at a computer and start typing away. It makes you feel very productivity because there’s something happening on screen. In contrast, simply sitting down and working out an algorithm in your head often makes you feel that you are not really doing anything. It’s unfortunate that we call what we do “programming” when much of our time and effort is actually spent program planning (or should be). As the preface to Structure and Implementation of Computer Programs says, programs should be written primarily for people to read and understand and only incidentally for machines to execute. It follows from that if you can’t understand your own program and prove its correctness, you shouldn’t expect a computer to do the same. That’s a lesson I learned the hard way and I hope I never have to relearn it.

Next Page »