Writing software for non-programmers

is not very easy. As a programmer, you learn to use various pieces of software, like text-editors, compilers, debuggers, profilers, so on and so forth. All these tools are very useful, some are essential, and many of them have a learning curve, you have to invest time into learning the nuances of the particular tool if you want to use it to the maximum potential. Consequently when you are writing software that you will be used by other programmers (even if it is not a programming tool), making the software easy to use is not always the first thing that comes to mind. Let’s solve the problem and then we’ll worry about making it ‘user-friendly’. User interface gets pushed onto the backburner. The UNIX software user space is filled with various command line programs which are very effective in their own right, each with its own host of little options toggled by specifying combinations of letters and words on the command line. Not a problem for us coders, after all, text is what we live in. Not so easy for non-programmers who find the very idea of typing in commands some sort of deep black magic. Hence the rise of “graphical frontends”: point-and-click interfaces that put interaction with these text utilities in the background.

For a while i was of the idea that the GUI was for wimps. If you wanted to harness the real power inside your computer, you had to get down and dirty with the command line. However I’ve since come to realize that the real problem is not one of interface, but of thought style. I’m currently a program that (I hope) will be used by artists and architects. It is text-based and the user needs to type in configuration files. No point and click GUI (yet). I chose to take on the job of writing the front-end: the part of the program that the user would actually interact with. This meant creating a configuration language that would be easy for non-programmers to write. The language needed to describe certain elements and how those elements could be combined. My first thought was to create some sort of CSS clone for element description and something akin to Prolog for combination rules. Now, if I was writing this for programmers, I wouldn’t have given it a second thought. But I was writing software for programmers, I was writing for artists. Since then I’ve changed my language to more like Basic than any other programming language. I’ve tried to make it as close to plain English as possible, with no special symbols except for parentheses. The reason for this is simple: any configuration language, no matter how simple would mean that the artists would have to learn it’s uses. I would rather have them up and running and actually using the tool than learning my arbitrary syntax. By keeping the language as close to English as possible, I’ve tried to minimize the learning curve.

Error messages are another area where I’ve had to keep in mind that my users won’t be programmers. Compilers are notorious for having hard to read error messages. I couldn’t have my config file parser give equally esoteric error messages. Most of my parser is a simple recursive descent parser that just does simple syntactical analysis, however a part of it involves solving a system of linear equations. Once again it was very tempting to have the error messages speak in terms of the underlying logic and math, which to a programmer at least, isn’t very complicated. However, I made the conscious to have the error messages be plain English as well. The user gets a line numbers and a plain English description of what went wrong. By leaving out allusions to the math and logic, I may have made it harder for the user to fix an error in the configuration file, but that’s a risk I’m willing to take for the time being.

Programmers are instinctive tinkerers. We like playing around with things. Even when things are working right, we have a need to get in there and try to make it work faster, better, or just in a different fun way. We adore customization and love extensibility. Simply getting the job done is not enough, we want to find the most elegant, the fastest, the most efficient, the craziest way to do it. Living in our world of programmable text editors, modular IDEs and extensible browsers it’s easy to forget that often people just want software that gets the work done and stays out of their way. They don’t want unlimited power, they just want the damn thing to work.
Exercising this sort of restraint is not easy and it requires a lot of thought. I don’t want to give the user something that will overwhelm them, but at them same time, I don’t want to give them a crippled half-hearted piece of work either. Unfortunately there’s is no way to tell what a user wants or needs before you actually give them something to play with. Our software is at a tipping point: it has a small set of core features which work well, there is room to add much more, but we don’t know what will be essential and what will be extra cruft that no one ever needs. And so we are heading to meet our “clients” next week and hopefully at the end of the meeting we will have a new goal in mind.

Use your own software

Also known as “eat your own dog food”, this is the concept behind one of the most successful software engineering projects of modern times: the Windows NT kernel. The Windows NT kernel was written by a highly-talented team led by a man who is arguably one the best software engineers of all time: Dave Cutler. Dave Cutler was also the lead developer for another groundbreaking operating system: Digital’s VMS. However there was more to this project than talented developers: the whole team actually used Windows NT everyday as soon as possible. This meant that the developers exposed themselves to problems that would be encountered by the average user and could then fix those problems before actually shipping the finished product.

Let’s face it: software is buggy. We still have no clue about how to reliably write bug-free software. So we’re stuck with the situation of writing buggy software and then wrangling the bugs out of them. A lot of bugs are removed through the process of just writing a working piece of software. Automated testing also gets rid of a fair amount of bugs. However, there are some things that no amount of debugging or automated testing can get rid. Modern software systems are large and complicated and it’s hard to tell exactly how all the different parts will interact until you actually start using it all.

Even if you’re certain that your code is relatively bug-free, it’s still important to use software that you’ve written. There are a lot of things about software like look-and-feel, intuitiveness, ease of use, which can’t be determined automatically. The only way to see if your program has an elegant and smooth interface or is powerful, but clunky is to use it repeatedly. When you start using software that you’ve written on a regular basis, you start to think about how your software can be improved, what are the bottlenecks and hard-to-use features, what features are missing and what are unnecessary. This constant evaluation is a key ingredient of making better software.

Unfortunately, most of the software that is being created is made by developers who really don’t know how that piece of software is going to be used in the real world. After all, Photoshop wasn’t made my a team of artists and most users of Microsoft Word have never written a line of working code. So how do programmers go about creating software that they might never use? Enter the beta tester. Beta testers are given pre-release versions of software to evaluate and their suggestions are then folded back into the next release. The best beta testers are the very people for whom you’re writing the software in the first place. If you’re writing a software package for a specific client, then it is essential to have a continuous dialog open. Test versions should routinely be given out to get feedback and then that feedback should be incorporated into the next edition. If your software is for a mass market, then your user community will be your pool of beta testers, encourage them to give feedback and then take those opinions into account. Eating your own dog food is a good idea, but it’s not a disadvantage if you can give it to a bunch of other dogs and see what they think of it.

Throw away version 1

One of the best books on software engineering is Fred Brooks’ The Mythical Man Month. Even though it’s over thirty years old, most of the principles and ideas laid down are still applicable, some even more so now than thirty years ago. One of the ideas that I feel deserve special mention is that of the pilot system.

Like many great ideas, this one is very simple in essence. The first working system created as part of a software project is one that will be thrown away (intentionally or not), in the same way that chemical engineers will first build a pilot plant to test out their process before investing in a full scale plant. The reason for this is that it’s hard to get a grasp for what the real problems of a software system will be before some of it is actually implemented. Keeping this in mind, it’s best to start projects with the idea that version 1 will be thrown away and schedule the project accordingly.

The benefits of starting with a “throw away version 1″ philosophy are many and can help the development of any software project, large or small. Knowing that the first working version doesn’t need to be perfect lets the development team quickly create a prototype to explore the different aspects of the project. It allows experiments to try out various solutions to a problem before deciding which one to use. If you’re making software for customers (as most people are) having a working model early in the project’s schedule allows you get valuable user feedback and make changes accordingly. It’s often said that software users don’t really know what they want, but it makes everyone’s life a little bit easier if there is something tangible that can be pointed to and specific aspects of it praised or criticized.

Of course, knowing that you’re going to throw away the first version doesn’t mean that this version should be made sloppily or inconsistently. The same technical rigor should be applied to the prototype as to the final delivered product. In fact, making version 1 in a half-hearted fashion defeats the entire purpose of having such a version in the first place. Problems in the design of the product will only become apparent if version 1 reflects fairly accurately what the final is supposed to be. Engineering models are made to the same specifications and requirements as what the final should be (albeit on a smaller scale) and there is no reason why software prototypes should be any different.

I’ve had the opportunity to see the benefit of this approach first hand. We recently completed version 1 of our summer research project. Though we had known that the final version would probably be very different, we worked under the assumption that version 1 would be the basis for the later version and that if we screwed up now, we would have to pay for it later. The result is that version 1 really was a proper working system, which works (from the user perspective) in a manner very close to how we want the final to work. Of course, we are still throwing away version 1. Even though it works fine, we’ve seen that there are a number of fundamental flaws to our approach. The ad-hoc parser that I wrote for our configuration language works, but is ugly and inflexible. I’ve already replaced it with a more robust and flexible recursive descent parser. The output mechanism is currently hard-coded into the rest of the system, but it should be pluggable; we now have an idea of what sort of an architecture we want. In fact, the very core of our system will probably have to be replaced or at least changed substantially, because we just realized that what we had written wasn’t really what we had set out to do. All this might sound like a disaster, and it would have been if version 1 wasn’t designed with the idea of eventually throwing it away. We used it as a test bed and learned valuable lessons from it which we’ll now apply to the creation of version 2.

Unless your project is something very trivial, it’s best to start with the idea of throwing away version 1 and learning from the lessons that it will inevitably each. There is another software engineering principle that runs complimentary to “throw away version 1″ that can be best described as “eat your own dog food”, but that’s something for another post, maybe tomorrow. Till then, have fun with version 1.

Building software or growing it?

For the past week I’ve been working on my first multi-person, long term software project. As part of my summer research project we’re building an urban planning system which automatically builds sample cities from a set of design conditions. I’ve been working with two other people and my professor, and though it isn’t a large scale project like writing an operating system, it’s still quite challenging. One of the major challenges that I’ve come to face is that it’s a very hard to stick to any designs that may have been made before actual writing of code. Furthermore, in a research project such as the one I’m working on, the problems aren’t well known and we’re discovering them as we go along. This results in the problem that I’m continually rewriting parts of my code and adding parts as I go along.
In class we were always taught to plan out our programs in a engineering fashion, and then implement. I’ve tried to follow this plan-then-implement methodology, but I’ve to improvise quite a lot. So I’ve been wondering, what’s the best way to make large pieces of software: grow them or build them? Building software is what I’ve been taught in class: where we make a  plan and then follow it. Now anyone who has written a sizeable amount of code will now that it’s impossible to plan for all eventualities. That’s what debugging is all about: dealing with things that you didn’t plan for. However debugging deals with errors, not with the basic structure of the architecture. I’ve previously advocated carefully planning out algorithms before implementing, and i still to that. But when it comes to actually building large software systems, it’s important not to try to plan everything out. I’ve come to realize that it’s best to combine a moderate amount of planning with a technique of organically growing your software.
What do I mean by growing? It’s simple, start with something small, that you know works (and that you’ve tested thoroughly) and keep adding on to that. You keep adding functionality one bit at a time, getting closer to what you finally need. This method has a two fold benefit, firstly it makes sure that you have a working program before you starting adding more functionality. That lets you rollback to a working program whenever things get out of control. Secondly, it leaves you free to add functionality and deal with complications without having to worry about fitting things into a preconceived plan.
A well conceived plan is good and essential to a successful software project, however it’s good to leave room for the eventualities that will inevitably occur, Growing software and building things on top of an already working base is a way that allows creation of software in a structured manner that leaves room for dealing eventualities.

How many programming languages should I learn?

OSNews has started a series called A-Z of programming languages where they’ve been posting interviews with the creators of well-known programming languages. Till now they’ve done AWK, Ada and BASH. Of those three the only one I’ve had any experience with is BASH, and not too much of that. But considering that there are literally hundreds of programming languages (and many more dialects or implementations) which ones should one learn to be a good programmer?

I know that many programmers out there simply learn just one or two languages and then use them throughout their careers (or at least until it becomes impossible to find a job). Certainly that works, to some point at least and so the question is, do we really need to learn languages that are not the “industry standard” (i.e. whatever has the most jobs on offer). If all you’re interested in is a job, then no, you don’t. One or two languages will probably be enough. However, if you want to keep learning and keep developing as a programmer, then the answer is most certainly yes. Some programming languages are quite similar in terms of syntax and power, but some are very different and teach you think in different ways. It’s these different languages that are going to make you better as a programmer.

So we come back to our initial question: How many languages to learn and perhaps more importantly which ones? I think that there are two types of languages that are worth learning: Those that make you think differently and those that have been used to write a large amount of high quality code. Languages like Smalltalk and LISP and to some extent Java are strongly paradigm-oriented, they emphasize a specific style of programming, in the case of Smalltalk it’s object-oriented and in the case of LISP and it’s derivatives, it’s pure functional. Such languages will teach you important lessons which you can apply even when you’re using some other languages.

One of the languages that has been widely embraced by the hacker community is C. There is a incredible amount of really good code written in C, the most famous of which is probably the Linux kernel. Unless you’re a systems programmer, you probably won’t have to use C or C++ much, but you can benefit a lot from reading well written code. C and C++ can be used to write powerful code, but sometimes the power doesn’t quite justify taking the trouble of all the low-level work that you have to do. In that case, it’s good to have a general purpose high-level language lying around. I would strongly recommend Python, but you might find another language easier for everyday use. But once you have made a choice, learn it well and use it to it’s maximum.

It might be worthwhile learning a language that has powerful text processing abilities, like AWK or Perl. It might make your work easier if you know one or both. And there is an awful lot of code written in Perl for the purpose of gluing larger programs together, so it might be worthwhile to learn it. However, Perl has been falling from grace for a good few years and many people are now using Python and Ruby to the same things they used Perl for. I don’t have a concrete opinion of Perl at the moment, but I think it’s something you can put of learning until you have an actual need for it.

You should also learn Java. I personally consider Java to be a decent language, but not a good one and I probably wouldn’t use it if I had a choice. However, it is a very popular one and if you get a job as a programmer, chances are you’ll encounter a substantial amount of Java code which you have to deal with. And you won’t be much of a programmer if you can’t deal with other people’s code. So learn Java.

Knowing basic HTML or CSS is also a good idea. You might not be a full fledged web artist, but you should be able to throw together a decent web page without much trouble. Considering the growing importance of the web, learning a web programming language is becoming important. It’s not quite a necessity yet, but I think in less than 5 years it will be. I can’t recommend one now, because I have no experience, but I think that Ruby might be a good idea, because it’s a decent general purpose language as well.

I should say, that I don’t know all of the above, but I have had some experience with each of them. I think each of them have contributed to making me a better programmer, and that the more I delve into them and use them for harder problems, I will continue to improve. In conclusion, I would like to say that if you are committed enough to learn multiple languages well, you should also invest some time in learning a powerful text editor such as Vi or Emacs. Though you can certainly write great code with nothing more than Notepad, using a more powerful tool can make your job quite a bit easier (and considerably faster). Once you turn fine-tuning these editors to suit your style and habits, you won’t want to use anything else. If you’re seriously out to become the best programmer you can be, you’re going to want the best possible tools at your disposal.

I’ll be happy to here your comments on what programming languages you might recommend to anyone looking to improve their programming.

What would you teach in CS 101?

Teaching is not easy. Especially when it is a subject like computer science. It’s very easy to get too theoretical and teach all the concepts, theories and techniques while not saying anything about how to apply all of that to writing practical working software. At the same time it is equally easy to teach only simple programming and in the process turn the theoretical aspects into drudgery which students only learn because it’s part of the degree requirement. Teaching computer science is all the more complicated because for all it’s innate fun and the enjoyment one can get from it, computer science is hard.

Joel Spolsky in one of his most famous blog posts blasted modern computer science curricula at almost all colleges for dumbing down computer science. In particular he criticizes the use of Java as a first programming language. Joel believes that computer science courses are no longer hard enough. in the old days (read 10-20 years ago), computer science were designed to be hard, the intro courses were designed to weed out the students who weren’t dedicated and smart enough. He claims that modern courses do not teach students the fundamentals of pointers and recursion and ill-equips them to work outside their comfort zone.

When I first read this post almost two years ago, I was…. enthralled. Here was someone actually saying that computer science should be made hard. It was probably the heretical nature of his article that enthralled me. I had been learning Java in school for almost a year and had developed a strong dislike for it. I should say at this point that my computer science course was not taught particularly well. My teacher had been doing C++ for years and it was obvious that he was not particularly comfortable with Java. Though the concept of object-oriented programming was introduced and we learned the advantages of it, I don’t think I really understood it. And I felt the whole business of writing classes with main methods and such for the simple programs that we were doing very tedious. I did well in that course but didn’t like it much. The dislike of Java stayed with me for quite a while, another reason I liked Spolsky’s article so much. A few months later as I read the first few pages of Structure and Implementation of Computer Programs, I understood why Spolsky spoke so highly of it and the course it was written for. By that point I had come to the realization that almost all intro computer science courses in the world were complete and utter rubbish with the exception of MIT’s 6.001.

But all that was more than a year. Since then I’ve gone through my first college CS course (though I skipped the intro course), I’ve programmed more in both Java and Scheme and other languages and I’ve read far and wide about computer science and programming. The result is that while I still argue with Spolsky in principle, I’ve come to realize that the issue is far more complicated. It’s certainly true that there is far, far more to computer science than programming. It’s not easy to realize that when you’re dealing with the complex framework and syntax that Java and most other languages give you and most intro computer science courses carefully hide away the powerful theoretical architecture that forms the foundations of computer science. Why? Because to gain the power offered by this theoretical architecture you have to pay the price of intense dedication to learning things that are far more unfamiliar (and hence perceived harder) than most other things that you are ever likely to learn.

Computer Science is hard and most people don’t like doing hard things. Making computer science easy helps to keep people in the department who would otherwise become Econ/history majors (yes I personally know more than a few people who have done so). And you need people in the department if you’re going to have a department. Of course, even if the intro courses are easy, the more advanced ones are not quite so. It would be really interesting to see how easy you could make operating systems or programming languages. If you want to take an operating systems course you’ll have to learn C or C++ and wrap your head around pointers. And no programming language course is complete without heavy recursion (BNF Grammars anyone?) and some amount of functional programming.

But coming back to CS 101, how do you make a course that is easy and interesting enough to make it attractive to freshmen while not giving them a false idea of the difficulty ahead? What tools do you use and how much theory do you teach? Any one of those questions would be hard to answer on it’s own without the added complication that in many places CS 101 is also required of engineering and math majors who aren’t going to take higher level computer science courses. I would personally like a hard intro courses that doesn’t mince words and gets straight to the point of how programming actually works. I think MITs 6.001 is a wonderful idea, which is why I’m teaching myself Scheme and reading through SICP. The things that I’ve learned there help even when I’m programming in Java and I’ve become a far better programmer for it. At the same time I realize that if my college taught a similar course as an intro, our department would only be left with about 5% of students (maybe less). My fellow students are lucky that our professors go out of their way to teach real-world problems and talk about things other than simple object-oriented programming.

One solution to the problem would be make a separate “programming” course for those who are not going to be computer science students. This would allow CS 101 to be a harder theory-heavy course aimed at people who are fairly certain that they want to study computer science and probably have some programming experience already. Students who are interested in computers but are not sure if they want to major in it could start with the programming course and then progress on to the harder course. My college is starting to go down this path with a new course called computational methods, though students from other science departments are still required to take the normal CS courses. I hope that this sort of a division soon becomes standard practice.

Leaving the technical to the sides for a moment, I believe that a good introductory course should also include some aspects of the history of computers as well as references to the outstanding problems in the field (and there are a lot of those to go around). It’s important for students to realize that computers weren’t always as strong and powerful as they are now and that there are still great challenges ahead. I’ll be out of college in three years (and looking to tackle some of those challenges) and so I don’t think I’ll benefit from a restructured curriculum. However, I’m still interested to see what directions education in our field takes.

Musical Musings

I just received an email from the founder of Pandora asking for support in bringing a halt to the RIAA’s attempts to gather more royalties from online music broadcasters at rates which would effectively bring online radio to an end. I’m personally a big fan of online radio and I would be really sad to see it come to an end for no reason other than pure and simple greed. The entertainment industry, especially music is at a very important point. The monopoly held by record and broadcasting companies is gradually being brought to an end by the growing prevalence of technology and the sophistication of media recording and editing tools available to the common man. Of course the industry isn’t about to let go of its major streams of income without fighting tooth and nail for it. The result is the proliferation of techniques and technologies which have no other purpose than restricting how consumers can use the media that they have legally bought.

Apple’s extremely successful online iTunes store uses digital rights management and a patent-licensed, non-free file format to prevent copying music. However the result isn’t really the stopping of privacy, but growing inconvenience for the user. You can’t officially use software other than iTunes to transfer music to your iPod and music bought on the iTunes store can’t be played without iTunes or some sort of unauthorized, unsupported plugin. I feel that DRM is quite simply an injustice to honest paying customers. While I do not support piracy and feel that musicians should be adequately compensated for their work, I don’t think any authority has the right to tell me what I should do with music that I have paid for. The RIAAs claims that storing my music on an online backup system like MP3Tunes or even on multiple CDs is illegal. Excuse me if I disagree.

The strangest part of this whole affair is that it is technologically impossible for any authority to regulate copying the way that the industry wants to. If you can create software to lock down particular media files, it is also possible to create software to open those locks. Of course the easiest thing to do, as a consumer is to simply not buy music or other media that is crippled by DRM or other restrictions. Music CDs are one way to go. However, if you are the type who prefers to buy music in a purely electronic online, you don’t have to turn to Apple’s DRM’d iTunes store any more. The recently launched Amazon MP3 store has a large and growing collection of DRM-free 256Kbps MP3 tracks for download as soon as you have paid. These are plain old MP3 files that can be copied and transferred without limit and loaded onto any MP3 player. I’ve been considering buying music online, and though I would still pay a little extra for a CD, I think Amazon’s store is a much better option than the iTunes store and it’s a great way for conscious consumers to vote with their wallets.

My own music collection is purely MP3, ripped from the CDs. I do use the iTunes/iPod combo, because it works for me. However, I do maintain separate backups of my music (on a separate computer and on DVD). The recent attempts by Apple to prevent the use of the iPod with iTunes MP3 players has disturbed me somewhat, but for the time being, I am content to tolerate it. At the same time, I don’t use the non-free AAC format (which is default for iTunes). Since the MP3 is not actually free or open, I have considered changing my music to something that is, such as Ogg Vorbis. However, I think that at this point that would be rather inconvenient for me. Most importantly, I use my iPod a lot and the iPod doesn’t support the Ogg Vorbis format. The Rockbox firmware for iPods and other players allows playing Ogg Vorbis files, but it only supports older iPods. When it becomes available for newer iPods, I will seriously consider a switch. In fact, I have been looking around for an older iPod that I could get for cheap to try Rockbox on it.

Though I’m content to use the iPod/iTunes combo for the time being, if Apple were to try to lock me in to its proprietary format, I would not hesitate to switch to a less restricted player (and Ogg Vorbis while I’m at it). I suspect that many other people would do the same, especially tech-savvy early-adopters. And it’s probably not a good idea to get the early adopters unhappy.

Edit: I had incorrectly referred to AAC as Apple’s proprietary format. Thanks to the first commenter below for pointing this out.

Interactive programming

Programming is an inherently interactive task. It involves a continuous cycle of action, feedback and more action. Once the first draft of a program is written, it is compiled and run and any errors and bugs have to tracked. The source code is then changed and the whole process starts again. However this process of interaction is not quite as seamless as it might be made out to be. Anyone who has ever sat around waiting for code to compile will know that large swathes of time can be spent just sitting around. And the larger your project is, the more time you will spend just waiting. A lot is made out of how much software companies lose to employees playing games, chatting and reading blogs. The loss to long compile time may well be more, or at least it was.

With the rise of purely interpreted languages, the long waits required for compilation have shortened considerably. Modern compiled languages like Java and C# also have fast compilation tasks since they compile to an intermediate bytecode rather than straight down to binary. In both these cases there is still a fast compile time (or no compile time at all) is gained at the cost of a slower execution time. Is this a good thing? This raises the question of programmer time vs. machine time all over again. These modern languages allow for much better use of a programmers time at the cost using more machine time. In most cases, this tradeoff is not only acceptable, it would be foolhardy to go the other way. There are multiple reasons why Java has won away many programmers from C and C++ and this is one of them. The increase in interaction gained from using something like Java and C# means that programmers can spend more time actually working on their code, seeing it respond to use, catching bugs and fixing them. Debugging is one of the most tedious and hated aspects of programming. Programming should be fun, but debugging can turn it into a horrid game of finding the needle in the haystack. The feeling of hopelessness that you get when all you can do is watch your compile messages stream by can make the experience all the more painful. Once again, increased interaction puts the fun back in programming. Imagine playing a first person shooter where you could only take one shot a minute and then compare that to what a modern FPS is really like. 

So much for the professional programmer and the software company. What about the fresh young computer science student like me? There are two aspects to reducing compile times and increasing interactivity. Firstly, it does make things more fun. It’s faster to poke around in your code and see what happens. You can try out more than one approach to a problem without feeling like you’re wasting time. I feel that encourages you to look beyond the most obvious solution to find ones that are more efficient but require more effort. Of course it helps if the instructor rewards innovativeness and elegance and not just simple correctness. As one of my senior friends likes to remind me Computer Science is an experimental science. It helps when your experiments take less time for the simple reason that you can then do more of them to find out what works best.

However the flipside of making experimentation easy is that it can encourage sloppiness and discourage careful thought and planning before starting to code. I’ve experienced this first hand while doing a semester project. I was coding in Java and I believe the speed of recompiling encouraged me to just make random changes hoping they would work. Would I have done the same if I was coding in C or C++? Probably yes, because the core problem was deeper than a question of programming language, but I am certain the randomness of my changes would have gone down. Less importantly, but more obviously, I’ve become considerably sloppy when it comes to syntax. I still make silly omissions even though I have been coding in Java extensively for over 5 months now. I would probably be more careful and would reread my code if I knew that I would be incurring a significant time penalty for every compile-time error. Unfortunately there are no easy to mitigate these adverse effects. The only solution is discipline and commitment on the part of the student. Bad habits are hard to break, so it would be better to simply form good habits in the first place. 

Just as there have been advances to programming languages to encourage more interactive programming, their have been improvements to the programming environments to support interaction. My first introduction to interactive programming was probably using good old Logo turtle graphics, which is decades old, but the concept has been recreated numerous times. The step to C++ meant a jump down in interaction, but I’ve recovered much of it with Python. Python and Ruby have interactive interpreters which let you enter statements and see their results immediately. It’s a very good way to test out a small chunk of code before actually implementing it, because you don’t have to write the wrapper functions or classes before seeing what may happen. It’s hard to improve on the concept of an interactive shell. Once you can execute program statements directly, if you have the proper libraries, you can interactively create things ranging from simple number crunchers to GUIs. Of course, there’s a time when you have to stop being purely interactive and commit code to program files, but even then, you can test the code you’ve saved interactively as well. I often wish that Java had something of the sort, especially when my programming assignment involves writing only a method or two without worrying about the rest of the class. The BlueJ IDE allows this to some extent by allowing interactive object creation and testing, but I’m not sure if the other larger IDEs have similar features. 

So where is interactive programming going in the future? It seems that interpreted languages with almost no compile time are gaining popularity and the use of interactive IDEs should increase as a result. There are also programming education projects that I find very interesting, especially MIT’s Scratch. Things like this make programming more appealing to younger children, something that is essential if the computer science field is to continue to thrive on an abundance of powerful, imaginative minds, like it has since its inception. As our machines become more powerful (both through increase in speed and by parallelism) compile times will continue to fall meaning that even programming in close-to-the-metal languages will become less tedious and more interactive. We’ve come a long way from the days of batch processing punch-card programs and waiting hours for the results. I don’t think we’re going to stop anytime soon.

MapReduce to the rescue

Searching and sorting are common problems given to computer science students. They are also very interesting problems, which have a number of different approaches some of which are better than others (depending on circumstances). Most things can be searched: integers, strings, complex data objects, pretty much anything that can be compared can be searched and sorted. Searching and sorting string data is especially important since it has wide applications in areas such as natural language processing. So here’s a question: how do you search something that is very large (say thousands of gigabytes) and how do you do it so fast that the person doing the search doesn’t even have time to think about the next query before the results are found?

It would be utterly ludicrous to do this with just a single computer. As most people who have used desktop search know, the process can be frustratingly slow. But even if you add dozens, or hundreds of computers, searching can still be a delicate problem. The question then becomes how do you properly utilize your computing resources? Using the old technique of divide and conquer might be a good idea, splitting up the search among numerous CPUs, having them each do a small part and then combining the results. Google’s MapReduce does just that. Each Google search query requires the search of Google’s huge web index which is many terabytes in size. Google has thousands of processors lined up for doing such a job. MapReduce provides the infrastructure for breaking up a single search over terabytes to thousands of much smaller (and hence, much faster) tasks. The results are then combined and displayed to the user in record time.

MapReduce takes its name from two concepts in functional programming. Map takes a function and a list of inputs and then applies that function to each of the inputs in turn, producing another list with all the results. Reduce works by taking a list of inputs (or a data structure of some sort) and then combining the inputs in a specified manner, returning the final combined value. It’s easy to see how this paradigm can be applied to searching or sorting. Map would search small chunks of the given data (in parallel) and Reduce would then combine the various results back together into a single output for the user.

But it’s not just searching or sorting that can use the map-reduce approach. Whenever you have to apply an operation over and over again to multiple independent inputs and then combine the results into a single unified result, you can use some implementation of map-reduce. Things get difficult if you have dependencies between the inputs, and it’s these dependencies that make parallel programming difficult.

So now that I’ve told you about the wonders of MapReduce, you want to go play. That’s understandable, but you’re probably not in charge of one of Google’s data centers and so don’t have direct access to Google’s powerful MapReduce infrastructure. So what do you do? Well let me introduce Hadoop: Hadoop is an open-source implementation of MapReduce in Java designed to make it easy to break down large scale data processing into multiple small chunks for parallel processing. It implements a distributed file system to spread out data over multiple nodes (with a certain amount of redundancy) and processing data chunks in place before combining them back. It’s currently been run on clusters with up to 2000 nodes. Hadoop might not be as powerful as Google’s MapReduce, (after all Google has deep control over both the hardware and software and can fine tune for maximum performance) but it will get the job done. So what are you waiting for? Go find some reasonable gigantic data processing problem and MapReduce it into insignificance.

To object orient or not to object orient

That is the question. With my computer science course coming to end soon, I’ve been thinking about programming languages a lot. We’ve been using Java exclusively for this course and personally I don’t really like Java that much. My favorite language for application programming is Python, while for learning about computer science in general Scheme is my top choice. One of the things that I’ve been thinking about over and over during the semester is Java’s object orientation and type system, both of which I have mixed feelings about.

Let’s start with object orientation: As every computer science student knows, an object is a combination of data and methods to manipulate that data. That’s a very good way to model the interactions in a computer program and when used and implemented properly it is a very powerful tool. However what most students don’t realize is that object orientation is often one layer of complexity too many. There are often situations where you just need a simple function to perform some sort of manipulation on numerous data inputs. There is little data actually associated with the function itself, most of it comes from outside, and so forcing into an object-oriented methodology is clunky and inefficient. But Java doesn’t let you have free standing functions: functions aren’t first class citizens. This leads to some interesting workarounds. You have to create a wrapper class obviously, but after that things get interesting because Java has static methods. Unlike normal methods (for which every object of a class gets a copy), there’s only one copy of a static method which all the objects share. This makes sense conceptually, somewhat. If you have a sort method, you don’t need multiple copies of the sort method, you just want one method through which you can pass whatever it is that you want sorted. However it’s still tied to a class somewhere, which means that something is still “owning” that function (which doesn’t quite make sense). This is perhaps my biggest gripe about so-called “pure” object-orientation: there are some things that don’t conform well to the object-oriented ideology (and the other end is pure functional programming, but that’s a topic for another post).

So Java doesn’t have allow functions as first class citizens. I would like Java more if its object orientation was completely uniform, but it’s not. Not everything is Java is object-based. Java’s system for data types is a mix of objects, primitive types and arrays, each with its own different style. You can’t subclass either primitives or arrays. Primitives aren’t objects (I guess thats why they’re called primitives) and arrays are objects without classes. It’s something of a mess. This can lead to problems later on when you try to work with the various data structures that make up the Java Collections Framework. You can’t use the primitives as part of a collection, you have to use its corresponding wrapper class. That isn’t really a problem per se, but it’s one more thing to remember. By and large I do like the Java Collections Framework and the Java generics, though on occasion it can make code hard to read due to the smattering of angle brackets it entails.

However one aspect of the Framework that does irritate me from time to time is Iterators. To access the elements of a collection in order, each collection class must have an embedded Iterator class. To iterate over the elements of the collection, you have to create an instance of the iterator class and then use that. I personally would have liked to see iteration methods built into the collection classes themselves, but that is not something that is going to happen. Truth be told, iterators do get the job done. At the basic level they let you move forward over the collection and delete elements on the way. I personally try to avoid the use of separate iterators, I prefer using the “smart” for loop. This version of the for loop is similar to what Python has, it creates an object of whatever type the collection objects are and then uses that to access each object in turn. Unfortunately, the smart for loop fails to give you full access to the underlying iterator. I ran into this during my last exam when I tried to use a smart for loop to deletions to a TreeSet while moving through it. I couldn’t, and while I understand the technically difficulty, I know enough to know that it would have been possible and would have made the smart for much more useful. It wouldn’t have worked for plain arrays though because arrays don’t have underlying iterators.

So what has Java taught me about object orientation? Firstly, one size never fits all. Object orientation isn’t the perfect model for software development and it’s important to give programmers a choice, in this case: first class functions. Secondly, consistency is important. One of the reasons that C++ is gradually falling from grace as a language for large system development is its very inconsistent syntax, especially when you start doing more advanced programming. While Java is much more comfortable than C++, there are enough inconsistencies to make it occasionally hard to use. However all things said, both Java and object orientation are powerful tools for software development and its important to learn them both well, but also to realize that in certain circumstances there are more suitable tools.

« Previous PageNext Page »