Archive for the 'Programming' Category

Should we make heavy software

When I showed off Firefox 3’s smart address bar to a friend of mine a few days ago, his first reaction was: “Wow, that’s cool” and half a smile. A second later the smile was gone and he let out a disappointed “Oh but that’s going to take up so many resources”. Let’s leave for the moment that my friend was not the tech-savviest person in the world and that he would probably not be able to list exactly what those “resources” were supposed to be. What his remark got me thinking was simply: “Is it really a bad thing if my software takes up many resources?”

I suppose the core issue here is that computer users don’t really care about resources. Most of my friends wouldn’t be able to tell me the specs of their computers and much less what those specs meant. What they are interested in is quite simply, speed. They want their software to be zippy and fast, so do I, so does everyone (I hope). When my friend said that Firefox 3 would take up more resources, what he really meant was “My computer will run slower.” He quite innocently equated less features to mean lightweight and hence faster. It’s an honest mistake, but that doesn’t change the fact that it is not really correct. Tweaks and customizations to programs can often increase actual program size and complexity but give better performance. Apple has been doing with OS X for years now. Many users and benchmarks will vouch for the fact that though OS X has been gaining features, it’s also utilizing computer resources more efficiently.

Of course, it’s undeniable that the general trend is towards programs that use more resources. After all, more resources does mean more maneuvering room, more stuff to build with. And resources to keep increasing. Our computers today are vastly more powerful than those available at even the start of the decade. And we’ve been reaping the benefits with more sophisticated software: better visuals, increasingly powerful desktop search and increasing higher resolution data formats.

At the same time, many people both programmers and not, are becoming increasingly worried that modern software is bloated an unwieldy. Just as Moore’s Law has been giving us faster processor, our software acts as a Moore’s Law Compensator. Our computers still take a long time to boot up and become usable, most programs don’t start in the blink of an eye, in essence, somehow the user doesn’t really see all the power that’s tucked under the hood. This has led to a growing trend in lightweight software, especially in the more tech-savvy community.

Among Linux users this dual nature of modern software is very evident. On the one hand Linux systems can now sport powerful 3D window managers and task switchers good enough to rival Vista or Leopard. On the other hand, there have been a wave of new minimalistic window managers lacking ant graphic splendor whatsoever.

I feel myself personally affected by this double trend: I myself use a tiling window manager called Awesome. I shun IDEs, preferring to use the very lightweight Vim from a simple terminal. Firefox is just about the only graphical program that I run. At the same time, I love OS X Leopard. I think the UI is quite beautiful and I’ve become an avid user of both Expose and Spaces. Not to mention the fact that somehow fonts seem to look much better on OS X than on any other operating system I’ve ever used. Performance-wise OS X is much better than Vista which takes much longer to start up and respond on a far superior system.

Being a programmer myself, this issue is particularly important because I think about performance almost all the time. Whether it’s deciding whether to use a high-level language like Python, or lug it out with something closer to the metal like C, or which graphics toolkit to use, I have to factor in resource-use. I’m still not sure about which way is better. Software bloat is bad. Very bad. At the same it doesn’t make much sense to let all that computing power sit there unused and the benefits very often outweigh the price to be paid. For the time being, I’m calling it a truce and letting the decider be something less tangible than performance benchmarks: user experience. If you can use heavy resources but deliver a solid user experience, then go for it. An incomplete user experience for the sake of a lighter program might be occasionally justified, but not all the time. However a bad user experience along with heavier system requirements is definitely a bad thing, to be avoided at all costs. After the best software is not the one that uses least RAM or has the prettiest interface, it’s the one that gets the job done without getting in the users way.

The role of software engineering in research

As my research project continues our software is gradually growing more and more features, and we’re adding new functions almost everyday. While it’s certainly a very good feeling to be adding functionality, it can also get ungainly very quickly. About two weeks ago, Slashdot posted a very interesting query by a person interested in doing a PhD on software design principles. Since then, I’ve been giving more and more thought to the role that industry-style software engineering principles play (or should play) in computer science research.

Research in science has been mostly held to be separate from engineering. Chemistry research is not performed in quite the same way that chemical engineering is. Even in cases where there is significant amounts of high technology equipment is used (the LHC is a good recent example), the engineering teams are separate from the scientists. However, computer science is not really a typical science at all. Research in computer science faces many of the same problems that software development faces:

  • Multiple people working simultaneously on large and growing codebases (team management).
  • Adding or removing features without requiring major rewrites (extensibility).
  • A way to collect large amounts of data and detect failures (testing and error logging)
  • Having multiple copies of the experiment running, with different parameters (scalability and version control).
  • Last, but not least, the experiment should actually work and produce results (in corporate terms, the software should ship).

Keeping the similarities in mind, what lessons can we as researchers learn from industry techniques and practices (by industry, I mean open source as well)? I’m not sure what other researchers do, or if there is some sort of standard procedure, so I’ll just talk about what my small team has done to make our work easier.

Divide and conquer:

The first thing we did was to carve up the various parts of the project. While my two teammates took up the core engine and the output system, I chose to deal with the human interaction part. At the same time. we had regular meetings where we planned out how our parts should interact as well making high-level descriptions of what each part of the code should do. This meant we could each develop our code largely independent of the others while still having a fair idea of where the whole project was going.

Throwing away version 1:

Though our first version was a working program, we made it keeping in mind that we would probably have to change a lot of things, possible even rewrite it. As a result, we developed it quickly, and focused on making it “just work” without adding too many features. This allowed to get a good feel for what the rest of the project would be like and let us make important decisions before we had invested a lot of time into it. In research you often don’t know what you’re looking for until you start looking, and this was the case for us. Our throwaway version gave us a better idea of which direction our research was heading, what concepts had been explored before, what was new and how difficult various aspects were.

Strict Version control:

Right after we started working on our post-throwaway system, we moved all our code to a Subversion repository on our college’s research cluster. We do are own work in our working commits and perform daily commits and updaes. Our repository is also divided into trunk, branch and tag directories to keep separate the most recent copy, older releases and other milestones. This way we can see the evolution of our project at a glance more easily than having to check out older versions. This comes in particularly handy, since being a research project we tend to be looking at what we did before and what results we got rather regularly.

Coding conventions:

Another post-throwaway decision was regarding coding styles. We fixed general naming and commenting styles, decided on tabs vs. spaces (4-space indent, in case you’re interested) as well as a number of minor things that weren’t project-threatening, but could have led to unnessecary confusion later on. We also keep detailed records of how we’ve changed our codes and any bugs we’ve introduced/fixed on a common wiki meaning that everyone is always up-to-date on changes.

Personal freedom:

By standardizing some aspects of our project, we’ve been able to have greater freedom as individual researchers. All our code is in platform-independent Python. Besides that our only other common tool is Subversion, which is also freely available and easy to use. By choosing open source and platform-independence, we have been able to have three developers work on three different platforms using different editors, IDEs and SVN clients. We can work equally well on our own machines sitting in our rooms, on the computers in our CS department’s labs, or even from the Mac Pro computers in the Arts buildings via SSH. It’s hard to quantify how much easier this independence has made our work. This has meant that we didn’t have to rush to the CS Lab’s whenever we had a thought or problem and we could whenever we wanted, however we wanted as long as we stuck to some simple rules. Scientists and programmers both need a certain amount of freedom and we’ve been able to secure this freedom by a proper choice of tools and techniques.

Things to do:

It’s been almost four weeks since we started using software engineering practices for our research work. We don’t know how things would have turned out had we not implemented the practices that we have. However we have come to rely heavily on the framework and guidelines that we have built for ourselves. At the same time, we’ve come to realize that there are parts of our project that could benefit from adopting some more industry-standard practices. I’ll end with a brief list of the things we plan to implement in the next few weeks:

Robust error handling: We’re currently using some very ad-hoc methods to handle errors and track down bugs. But we’re in the process of building a flexible, yet robust error handling and reporting system into our project. Since we intend our tool to be used by non-programmers, we need to implement a system that won’t mean looking at the code everytime something goes wrong.

Flexibility via a strong framework: Our work has turned out to have more applications than what we originally intended for it. As a result, we were considering forking our code into two separate branches. Though we still haven’t reached a proper decision regarding this, I’ve personally been trying to redesign our project into a “framework + dynamic modules” system. This is purely a software engineering effort, since it will mean a lot of refactoring and restructuring, but will not have much of an external effect rather than making our job easier.

Unit testing and other automation: Till now, an important part of our work has been getting user feedback, partly because our system is very visual intensive and also because we weren’t quite sure what we were looking for. Now that we have a better idea of where our project should be headed and what we should be looking for, we can begin to automate some parts of our work. However we won’t just be checking for bugs, but rather generating a large number of results and then further improve our system to get what we want.

I’d love to hear from anyone who’s been involved in computer science research projects and has opinions of how to use software engineering principles to make things easier/more effiicient.

Cocoa, Python and the quest for platform independence

For the past few days, I’ve been looking into Apple’s Cocoa infrastructure in some detail. The reason once again stems from my research projects. All our code is in platform-independent Python meaning that we can develop on any of the 3 popular operating systems and then run on any others. This has been a great advantage since the three members of our team each run different OSes. For most of the time our project was purely command-line and text input based, using graphics only for the final display. However we have now come to the point where we need to fork our system into two different systems: one for architects and one for artists. How each program will develop is something that we will work on over the next few weeks, but one thing is clear: for the artists at least, a purely textual interface will simply not do.

Till now our system had been accepting an instruction file as input. While this method of using a instruction was very powerful (we could essentially embed our own programming language), this not the direction we want to go with the artists. And besides, not every artist wants to sit down and right pages of instruction to draw a picture, we might as well ask them to write the program themselves. It would be best to create a GUI for them to use. Even before this had been clear, I had written a simple interface that was essentially a text editor containing a hook to our Python backend. It was written in Tk, was exceptionally ugly, but got the job done.

My Tk Interface running under Leopard

Now I think the Tk toolkit is capable of letting me create the entire GUI interface that I have in mind and it will be cross platform as well, but there’s one problem: it’s ugly, especially on the Mac. I think this is because the Python Tkinter version doesn’t use Aqua (like Tcl/Tk), but rather uses X11. Though I spend a lot of my time in Vim and at the command line, I’m still a fan of good looking user interfaces (see my last post). As a result I would feel rather guilty about making something that I know would look bad on the platform it would be used most. So like a good programmer off I went to find a solution.

I thought I would be able to use Cocoa to create a front end, especially since Xcode now has support for the PyObjC bridge. Unfortunately this has turned out to be much harder than it at first appeared. Cocoa is Apple’s very powerful object oriented API. It is almost everything you will ever need to create a Mac application. While it allows to create native applications by leveraging the powerful infrastructure Apple provides you, it also doesn’t allow you to perform the sort of rapid duct-tape styling programming that Python and UNIX encourages. In Python it’s very easy to write your core application in Python with any number of GUI front-ends that are easily interchangeable. Cocoa, on the other hand, is not another GUI toolkit that you can simply add to the mix. If you want to use, you’ll have to invest a certain amount of time in learning your way around.

Python can’t communicate with Cocoa directly. To  have your Python programs leverage Cocoa, you need to go through the PyObjC bridge. Though the bridge is certainly effective and I could use it, I’ve been having troubles finding decent documentation. Apple’s own tutorial is the sort of document I’m looking for, but it is dangerously out of date. Lacking a proper tutorial, I would have to figure out a lot of things on my own. Once again, a large time investment.

Considering that I have only two weeks left to make real progress on my project and that learning Cocoa or Objective-C was not my original goal, I’ve decided not to pursue that avenue at the moment. I would certainly love to learn Cocoa at some point in the future and PyObjC seems like it could become a very powerful tool. But I simply don’t have the time at the moment. I am going to start working on a proper GUI interface in Tkinter. Perhaps sometime later this year, I’ll get back to some Cocoa.

Back to the drawing board

I’ve just returned from meeting one of our clients for my summer projects: a professor in my college’s art department. What we have developed over the last three weeks is essentially a way to stitch together fixed images into probability-controlled tilings. However, it turns out that what our artist friends really need is more fine grained image control. They want to be able to create much more interesting patterns combining both images and geometric patterns. As I was explaining how our program worked, I realized that our instruction language model was rather complicated and not really suited for what the artists needed. So it seems, that our program is really not a very good match to what the artists need.

At this point, there are a number of things we could. We could go back to the drawing board, start from scratch and rewrite our system according to what the artists need. We could also try to build in top of our existing program and try to add and modify it to bring its functionality close to what we need. However, we need to keep in mind, that we had also planned to use this system to create a automated urban planner. We still haven’t met with the architect yet, but it seems like our system might be a better fit for what she needs. But that doesn’t answer the question: what do we do?

Now if this were a real software project with a delivery deadline, this state of affairs would probably be a very bad thing. Luckily this is still very much a research project so there’s quite a lot of leeway. Unfortunately, we’re still not quite sure of what exactly we have to do. It seems that our general architecture is still valid and will work, however the details will have to be remade. We’ll need a new way to generate the patterns from the instructions, simply stitching together images won’t do. It is possible that we will need multiple drawing systems working on top and parallel to each other. The instruction language also needs to be changed to make it more suitable to the artists’ need. We currently have a system that is geared towards placement of individual large scale elements, and nothing that deals with simple pattern generation. Ideally we would need a blend of both.

At this point, it’s easy to get disheartened. After all, we’ve spent almost 3 weeks creating something that we just learned is pretty close to useless. On the flip side, we are certainly not going to get bored. From a personal standpoint, this has still been a success, I’ve learned about parsing methods and language design. The experience that I’ve gained will make it easier for me to help develop a language that is powerful enough to express everything that the artists want to, while not requiring them to spend days learning it. It also means that the next version will also hopefully be developed much faster and more smoothly than the previous two versions. I am going to make it my personal project to see how fast and efficiently we can develop subsequent versions.

Writing software for non-programmers

is not very easy. As a programmer, you learn to use various pieces of software, like text-editors, compilers, debuggers, profilers, so on and so forth. All these tools are very useful, some are essential, and many of them have a learning curve, you have to invest time into learning the nuances of the particular tool if you want to use it to the maximum potential. Consequently when you are writing software that you will be used by other programmers (even if it is not a programming tool), making the software easy to use is not always the first thing that comes to mind. Let’s solve the problem and then we’ll worry about making it ‘user-friendly’. User interface gets pushed onto the backburner. The UNIX software user space is filled with various command line programs which are very effective in their own right, each with its own host of little options toggled by specifying combinations of letters and words on the command line. Not a problem for us coders, after all, text is what we live in. Not so easy for non-programmers who find the very idea of typing in commands some sort of deep black magic. Hence the rise of “graphical frontends”: point-and-click interfaces that put interaction with these text utilities in the background.

For a while i was of the idea that the GUI was for wimps. If you wanted to harness the real power inside your computer, you had to get down and dirty with the command line. However I’ve since come to realize that the real problem is not one of interface, but of thought style. I’m currently a program that (I hope) will be used by artists and architects. It is text-based and the user needs to type in configuration files. No point and click GUI (yet). I chose to take on the job of writing the front-end: the part of the program that the user would actually interact with. This meant creating a configuration language that would be easy for non-programmers to write. The language needed to describe certain elements and how those elements could be combined. My first thought was to create some sort of CSS clone for element description and something akin to Prolog for combination rules. Now, if I was writing this for programmers, I wouldn’t have given it a second thought. But I was writing software for programmers, I was writing for artists. Since then I’ve changed my language to more like Basic than any other programming language. I’ve tried to make it as close to plain English as possible, with no special symbols except for parentheses. The reason for this is simple: any configuration language, no matter how simple would mean that the artists would have to learn it’s uses. I would rather have them up and running and actually using the tool than learning my arbitrary syntax. By keeping the language as close to English as possible, I’ve tried to minimize the learning curve.

Error messages are another area where I’ve had to keep in mind that my users won’t be programmers. Compilers are notorious for having hard to read error messages. I couldn’t have my config file parser give equally esoteric error messages. Most of my parser is a simple recursive descent parser that just does simple syntactical analysis, however a part of it involves solving a system of linear equations. Once again it was very tempting to have the error messages speak in terms of the underlying logic and math, which to a programmer at least, isn’t very complicated. However, I made the conscious to have the error messages be plain English as well. The user gets a line numbers and a plain English description of what went wrong. By leaving out allusions to the math and logic, I may have made it harder for the user to fix an error in the configuration file, but that’s a risk I’m willing to take for the time being.

Programmers are instinctive tinkerers. We like playing around with things. Even when things are working right, we have a need to get in there and try to make it work faster, better, or just in a different fun way. We adore customization and love extensibility. Simply getting the job done is not enough, we want to find the most elegant, the fastest, the most efficient, the craziest way to do it. Living in our world of programmable text editors, modular IDEs and extensible browsers it’s easy to forget that often people just want software that gets the work done and stays out of their way. They don’t want unlimited power, they just want the damn thing to work.
Exercising this sort of restraint is not easy and it requires a lot of thought. I don’t want to give the user something that will overwhelm them, but at them same time, I don’t want to give them a crippled half-hearted piece of work either. Unfortunately there’s is no way to tell what a user wants or needs before you actually give them something to play with. Our software is at a tipping point: it has a small set of core features which work well, there is room to add much more, but we don’t know what will be essential and what will be extra cruft that no one ever needs. And so we are heading to meet our “clients” next week and hopefully at the end of the meeting we will have a new goal in mind.

Use your own software

Also known as “eat your own dog food”, this is the concept behind one of the most successful software engineering projects of modern times: the Windows NT kernel. The Windows NT kernel was written by a highly-talented team led by a man who is arguably one the best software engineers of all time: Dave Cutler. Dave Cutler was also the lead developer for another groundbreaking operating system: Digital’s VMS. However there was more to this project than talented developers: the whole team actually used Windows NT everyday as soon as possible. This meant that the developers exposed themselves to problems that would be encountered by the average user and could then fix those problems before actually shipping the finished product.

Let’s face it: software is buggy. We still have no clue about how to reliably write bug-free software. So we’re stuck with the situation of writing buggy software and then wrangling the bugs out of them. A lot of bugs are removed through the process of just writing a working piece of software. Automated testing also gets rid of a fair amount of bugs. However, there are some things that no amount of debugging or automated testing can get rid. Modern software systems are large and complicated and it’s hard to tell exactly how all the different parts will interact until you actually start using it all.

Even if you’re certain that your code is relatively bug-free, it’s still important to use software that you’ve written. There are a lot of things about software like look-and-feel, intuitiveness, ease of use, which can’t be determined automatically. The only way to see if your program has an elegant and smooth interface or is powerful, but clunky is to use it repeatedly. When you start using software that you’ve written on a regular basis, you start to think about how your software can be improved, what are the bottlenecks and hard-to-use features, what features are missing and what are unnecessary. This constant evaluation is a key ingredient of making better software.

Unfortunately, most of the software that is being created is made by developers who really don’t know how that piece of software is going to be used in the real world. After all, Photoshop wasn’t made my a team of artists and most users of Microsoft Word have never written a line of working code. So how do programmers go about creating software that they might never use? Enter the beta tester. Beta testers are given pre-release versions of software to evaluate and their suggestions are then folded back into the next release. The best beta testers are the very people for whom you’re writing the software in the first place. If you’re writing a software package for a specific client, then it is essential to have a continuous dialog open. Test versions should routinely be given out to get feedback and then that feedback should be incorporated into the next edition. If your software is for a mass market, then your user community will be your pool of beta testers, encourage them to give feedback and then take those opinions into account. Eating your own dog food is a good idea, but it’s not a disadvantage if you can give it to a bunch of other dogs and see what they think of it.

Throw away version 1

One of the best books on software engineering is Fred Brooks’ The Mythical Man Month. Even though it’s over thirty years old, most of the principles and ideas laid down are still applicable, some even more so now than thirty years ago. One of the ideas that I feel deserve special mention is that of the pilot system.

Like many great ideas, this one is very simple in essence. The first working system created as part of a software project is one that will be thrown away (intentionally or not), in the same way that chemical engineers will first build a pilot plant to test out their process before investing in a full scale plant. The reason for this is that it’s hard to get a grasp for what the real problems of a software system will be before some of it is actually implemented. Keeping this in mind, it’s best to start projects with the idea that version 1 will be thrown away and schedule the project accordingly.

The benefits of starting with a “throw away version 1″ philosophy are many and can help the development of any software project, large or small. Knowing that the first working version doesn’t need to be perfect lets the development team quickly create a prototype to explore the different aspects of the project. It allows experiments to try out various solutions to a problem before deciding which one to use. If you’re making software for customers (as most people are) having a working model early in the project’s schedule allows you get valuable user feedback and make changes accordingly. It’s often said that software users don’t really know what they want, but it makes everyone’s life a little bit easier if there is something tangible that can be pointed to and specific aspects of it praised or criticized.

Of course, knowing that you’re going to throw away the first version doesn’t mean that this version should be made sloppily or inconsistently. The same technical rigor should be applied to the prototype as to the final delivered product. In fact, making version 1 in a half-hearted fashion defeats the entire purpose of having such a version in the first place. Problems in the design of the product will only become apparent if version 1 reflects fairly accurately what the final is supposed to be. Engineering models are made to the same specifications and requirements as what the final should be (albeit on a smaller scale) and there is no reason why software prototypes should be any different.

I’ve had the opportunity to see the benefit of this approach first hand. We recently completed version 1 of our summer research project. Though we had known that the final version would probably be very different, we worked under the assumption that version 1 would be the basis for the later version and that if we screwed up now, we would have to pay for it later. The result is that version 1 really was a proper working system, which works (from the user perspective) in a manner very close to how we want the final to work. Of course, we are still throwing away version 1. Even though it works fine, we’ve seen that there are a number of fundamental flaws to our approach. The ad-hoc parser that I wrote for our configuration language works, but is ugly and inflexible. I’ve already replaced it with a more robust and flexible recursive descent parser. The output mechanism is currently hard-coded into the rest of the system, but it should be pluggable; we now have an idea of what sort of an architecture we want. In fact, the very core of our system will probably have to be replaced or at least changed substantially, because we just realized that what we had written wasn’t really what we had set out to do. All this might sound like a disaster, and it would have been if version 1 wasn’t designed with the idea of eventually throwing it away. We used it as a test bed and learned valuable lessons from it which we’ll now apply to the creation of version 2.

Unless your project is something very trivial, it’s best to start with the idea of throwing away version 1 and learning from the lessons that it will inevitably each. There is another software engineering principle that runs complimentary to “throw away version 1″ that can be best described as “eat your own dog food”, but that’s something for another post, maybe tomorrow. Till then, have fun with version 1.

How many programming languages should I learn?

OSNews has started a series called A-Z of programming languages where they’ve been posting interviews with the creators of well-known programming languages. Till now they’ve done AWK, Ada and BASH. Of those three the only one I’ve had any experience with is BASH, and not too much of that. But considering that there are literally hundreds of programming languages (and many more dialects or implementations) which ones should one learn to be a good programmer?

I know that many programmers out there simply learn just one or two languages and then use them throughout their careers (or at least until it becomes impossible to find a job). Certainly that works, to some point at least and so the question is, do we really need to learn languages that are not the “industry standard” (i.e. whatever has the most jobs on offer). If all you’re interested in is a job, then no, you don’t. One or two languages will probably be enough. However, if you want to keep learning and keep developing as a programmer, then the answer is most certainly yes. Some programming languages are quite similar in terms of syntax and power, but some are very different and teach you think in different ways. It’s these different languages that are going to make you better as a programmer.

So we come back to our initial question: How many languages to learn and perhaps more importantly which ones? I think that there are two types of languages that are worth learning: Those that make you think differently and those that have been used to write a large amount of high quality code. Languages like Smalltalk and LISP and to some extent Java are strongly paradigm-oriented, they emphasize a specific style of programming, in the case of Smalltalk it’s object-oriented and in the case of LISP and it’s derivatives, it’s pure functional. Such languages will teach you important lessons which you can apply even when you’re using some other languages.

One of the languages that has been widely embraced by the hacker community is C. There is a incredible amount of really good code written in C, the most famous of which is probably the Linux kernel. Unless you’re a systems programmer, you probably won’t have to use C or C++ much, but you can benefit a lot from reading well written code. C and C++ can be used to write powerful code, but sometimes the power doesn’t quite justify taking the trouble of all the low-level work that you have to do. In that case, it’s good to have a general purpose high-level language lying around. I would strongly recommend Python, but you might find another language easier for everyday use. But once you have made a choice, learn it well and use it to it’s maximum.

It might be worthwhile learning a language that has powerful text processing abilities, like AWK or Perl. It might make your work easier if you know one or both. And there is an awful lot of code written in Perl for the purpose of gluing larger programs together, so it might be worthwhile to learn it. However, Perl has been falling from grace for a good few years and many people are now using Python and Ruby to the same things they used Perl for. I don’t have a concrete opinion of Perl at the moment, but I think it’s something you can put of learning until you have an actual need for it.

You should also learn Java. I personally consider Java to be a decent language, but not a good one and I probably wouldn’t use it if I had a choice. However, it is a very popular one and if you get a job as a programmer, chances are you’ll encounter a substantial amount of Java code which you have to deal with. And you won’t be much of a programmer if you can’t deal with other people’s code. So learn Java.

Knowing basic HTML or CSS is also a good idea. You might not be a full fledged web artist, but you should be able to throw together a decent web page without much trouble. Considering the growing importance of the web, learning a web programming language is becoming important. It’s not quite a necessity yet, but I think in less than 5 years it will be. I can’t recommend one now, because I have no experience, but I think that Ruby might be a good idea, because it’s a decent general purpose language as well.

I should say, that I don’t know all of the above, but I have had some experience with each of them. I think each of them have contributed to making me a better programmer, and that the more I delve into them and use them for harder problems, I will continue to improve. In conclusion, I would like to say that if you are committed enough to learn multiple languages well, you should also invest some time in learning a powerful text editor such as Vi or Emacs. Though you can certainly write great code with nothing more than Notepad, using a more powerful tool can make your job quite a bit easier (and considerably faster). Once you turn fine-tuning these editors to suit your style and habits, you won’t want to use anything else. If you’re seriously out to become the best programmer you can be, you’re going to want the best possible tools at your disposal.

I’ll be happy to here your comments on what programming languages you might recommend to anyone looking to improve their programming.

MapReduce to the rescue

Searching and sorting are common problems given to computer science students. They are also very interesting problems, which have a number of different approaches some of which are better than others (depending on circumstances). Most things can be searched: integers, strings, complex data objects, pretty much anything that can be compared can be searched and sorted. Searching and sorting string data is especially important since it has wide applications in areas such as natural language processing. So here’s a question: how do you search something that is very large (say thousands of gigabytes) and how do you do it so fast that the person doing the search doesn’t even have time to think about the next query before the results are found?

It would be utterly ludicrous to do this with just a single computer. As most people who have used desktop search know, the process can be frustratingly slow. But even if you add dozens, or hundreds of computers, searching can still be a delicate problem. The question then becomes how do you properly utilize your computing resources? Using the old technique of divide and conquer might be a good idea, splitting up the search among numerous CPUs, having them each do a small part and then combining the results. Google’s MapReduce does just that. Each Google search query requires the search of Google’s huge web index which is many terabytes in size. Google has thousands of processors lined up for doing such a job. MapReduce provides the infrastructure for breaking up a single search over terabytes to thousands of much smaller (and hence, much faster) tasks. The results are then combined and displayed to the user in record time.

MapReduce takes its name from two concepts in functional programming. Map takes a function and a list of inputs and then applies that function to each of the inputs in turn, producing another list with all the results. Reduce works by taking a list of inputs (or a data structure of some sort) and then combining the inputs in a specified manner, returning the final combined value. It’s easy to see how this paradigm can be applied to searching or sorting. Map would search small chunks of the given data (in parallel) and Reduce would then combine the various results back together into a single output for the user.

But it’s not just searching or sorting that can use the map-reduce approach. Whenever you have to apply an operation over and over again to multiple independent inputs and then combine the results into a single unified result, you can use some implementation of map-reduce. Things get difficult if you have dependencies between the inputs, and it’s these dependencies that make parallel programming difficult.

So now that I’ve told you about the wonders of MapReduce, you want to go play. That’s understandable, but you’re probably not in charge of one of Google’s data centers and so don’t have direct access to Google’s powerful MapReduce infrastructure. So what do you do? Well let me introduce Hadoop: Hadoop is an open-source implementation of MapReduce in Java designed to make it easy to break down large scale data processing into multiple small chunks for parallel processing. It implements a distributed file system to spread out data over multiple nodes (with a certain amount of redundancy) and processing data chunks in place before combining them back. It’s currently been run on clusters with up to 2000 nodes. Hadoop might not be as powerful as Google’s MapReduce, (after all Google has deep control over both the hardware and software and can fine tune for maximum performance) but it will get the job done. So what are you waiting for? Go find some reasonable gigantic data processing problem and MapReduce it into insignificance.

Design for unit-testing

I’ve written before about the role of testing in programming and as I’ve written more code (and unit tests) over the past few weeks, my conviction that unit-testing is useful for more than just determining program correctness has become even stronger. In my previous post I spent about a paragraph exploring the other benefits of testing, I think it’s time that I offered a more detailed view of the alternate advantages of unit-testing.

As I started working on my last project for the semester yesterday, one of the things at the top of my mind was that the professor would be running his own unit-test on our programs. Our other projects had clear descriptions of what methods we were to create and how they should behave. But this project, a simple word counter written in Java, was different in that we were only told how the program as a whole should respond. We could shape the internals as we wished. Though it was very tempting to create a monolithic system (especially since the problem was rather simple), I decided that I should design with automated unit-testing in mind. Rather than making design complicated, this choice actually made things clearer.

It was obvious that the UI would have to be completely separate from the actual work portion of the code. Not only that, but that values returned from the methods doing the work would have to be in a form so that they could easily be compared to expected results. This meant that while the UI would deal with output formatting, there should not be very much formatting required in the first place. This guided me in the choice of the data structures that would eventually be returned.

Another area in which designing with testing in mind is useful is in determining if methods should be private. JUnit 3.8 does not support direct of testing of private methods. There is a certain amount of debate regarding whether or not this is a good thing, but this restriction does force a design methodology that can be beneficial. The only methods that should be private are the ones that need not be tested separately, that is, if the calling method passes testing, it automatically means that the private method is also working correctly. Though I didn’t need any private methods in this particular project, I did before and keeping the limitations of private testing in mind resulted in what I believe to be cleaner, more readable code.

Of course, it is easy to go overboard with unit testing. Your program shouldn’t be endlessly subdivided into lots of tiny functions just so that you tests every tiny chunk of code. Unit testing can help you design cleaner simpler program, but only if your design is pragmatic to start with. Like I’ve said before, no amount of coding tricks and development methodologies will fix a fundamentally broken design.

Jeff Atwood of Coding Horror said some time ago that unit-tests should be first class language constructs. At the time I thought that this was a bit overboard, but I’m coming to realize that this might be a good thing. Any language which has a decent error-handling mechanism would be able to bake in support for unit testing and having a unit-test mechanism directly in the language would probably encourage students to use it (and more importantly teachers to teach it). In my CS course we’ve using Unit-tests from early on, which I feel was very good decision on the part of the professor (even though many of my fellow students don’t quite seem to grasp it’s full importance right now).

Designing for unit testing encapsulates a lot of the ideas that are a part of good software design: UI separation, abstraction, code reuse and readability. Unit-testing is also a perfect example of abstraction: just worry about the big picture and the details will take care of themselves. So the next time you find yourself dealing with a big project, design for unit-testing and chances are you’ll be making better code than if you weren’t. Keep in mind though that no amount of unit-testing will replace actual user-testing…so make sure you get around to that at some point as well (hopefully as soon as you have something a user can actually use).

Next Page »