Compilers and nuclear reactors

This summer I’m doing an internship at a small, research-y software shop called GrammaTech. They do a lot of research-based projects and have a static analysis tool called CodeSonar that is really quite nifty. I’ve been poking around the CodeSonar codebase as part of my work and I’m pretty impressed by the kind of things they do. Most of it is in C and I can easily say I’ve never seen C written this way before. They leverage C’s close-to-the-metal nature and the preprocessor to create some pretty powerful abstractions. It makes me painfully aware of how much I still have to learn. While marveling at what a deep knowledge of C can do I came across “Moron why C is not Assembly” by James Iry – an explanation of why C is not just syntactic sugar for assembly language.

As I learn more about programming in general and programming languages in particular, I’m starting to think that there is something of a Law of Conservation of Power, vaguely akin to the conservation of matter and energy (but not quite as absolute). For example, Iry talks about how C enforces the stack abstraction and hides any parallelization in the hardware (or in the generated code). By moving from the assembly world to the C world you’re trading one form of power for another – you obey the constraints of the stack and function calls get much simpler. But you lose the ability to fine tune using dedicated hardware instructions.

This is true as you continue exploring more languages – give up the looseness of C’s weak type system for something stricter and stronger (ML with it’s algebraic datatypes for example) and you can have the machine enforce invariants and guarantee certain properties of your code. You can perform easier and more elegant matches and actions on the type of your data. But you give up the flexibility that comes of raw data and fundamentally unstructured bits. If you choose strict immutability and pure functions (like Haskell or Clojure) you get to interact with your datatypes in a more mathematically precise form, you get to reap the benefits of concurrency without worrying about data corruption (to some extent). But you lose the ability to quickly stash some metadata into the corner of some variable data structure and pull it out at will.

If we start viewing different languages as tradeoffs in the power available to the programmer then a compiler becomes a very special tool – a power transformer, akin to a nuclear reactor (or a Hawking’s Knot if you don’t mind some futurism). A compiler, at its core, takes a stream of symbols (according to predefined syntactical rules) and transforms them into another stream of symbols (according to predefined semantic rules). Along the way, the compiler is responsible for enforcing the power transforms – ensuring that your C code doesn’t get to the underlying stack, that your Haskell code obeys the type constraints. If our languages are tools with which we build our universes, our compilers enforce the laws of physics (whatever laws they may be). The reason we’re computer scientists and not physicists is that we can create and change these laws at whim, instead of only studying and exploring them.

Just as we don’t want to be putting nuclear reactors in SUV’s there isn’t a “best” power tradeoff. Do you need to be really close to the metal on a resource constrained embedded system? Use C. Do you need a guarantee that a successful compilation will rule out a certain class of runtime errors? Use ML. If you need a fuel-efficient vehicle to take you to work everyday, get a Prius. If you need a self sufficient, water-borne weapons platform that only needs to be refueled every few decades and can rain down vengeance on your enemies should the time come, then invest in an aircraft carrier or a nuclear submarine. Don’t bring a knife to a gunfight.

There are two big elephants in the room: Turing-completeness and Lisp. All Turing complete languages are strictly equivalent in their computational power, but that misses the point of this discussion. Most programmers are not writing code for machines at all: they are writing programs for programmers (including themselves) to read, understand and use (and only incidentally for machines to execute; thank you, SICP). When you change the rules of the game to be not strict computational power but expressiveness and understandability to another human, this Conservation of Power thing becomes much more important. Choosing the correct set of tradeoffs and balances (and hence the correct language and related toolsets) becomes one that has far reaching impact on your team and project. Make the wrong choice and the Turing tarpit will swallow you alive and surrender your fossilized remains to a future advanced insectoid race.

Lisp is another matter entirely. The so-called “programmable programming language” has been used to build everything from operating systems to type systems which are Turing complete in themselves. As Manuel Simoni puts it, in the Lisp world there is no such thing as too much power. Lisp laughs flippantly at the Law Conservation of Power by placing in the programmer’s hands the enforcer of the Law – the compiler itself. By virtue of S-expressions and macros Lisp allows and encourages you to play God. The natural way to program in Lisp is to “write code that writes code” – creating your own minilanguages tailored to the task at hand. With great power comes great responsibility of course so the Lisp programmer must be particularly careful.

I haven’t explored Lisp as much as I’d like to and I’m only just starting to look into ML and Haskell. But as a career programmer (or some approximation thereof) I think it’s a good idea to routinely move between various points on the power spectrum. That’s not to imply that it’s a perfectly linear scale, but that’s a matter for another post. As my experience at GrammaTech is showing there are delights, wonders and challenges no matter where you decide to plant your feet.

Languages abound

After almost a month and a half I’m back in a position to write The ByteBaker on a regular basis again. Instead of a lengthy explanation and reintroduction, I’m going to dive right in.

At the end of August I’m going to be starting my PhD program in Computer Science at Cornell University. Over the last few years of college I’ve developed an interest in programming languages and so I’m spending the next few years pursuing that interest (and hopefully writing some good software in the process). Programming languages are an intensely mathematical and logical area of study. In fact, I’ll admit that I am a bit intimidated by the amount of knowledge about the logical foundations of PL I’ll have to gather before being able to make a meaningful contribution. But on a personal level, it’s not really the mathematical rigor or the logical elegance of these systems that I find interesting. For me, programming languages, just like real languages, are a medium of expression.

In the end, what we’re really trying to do is express ourselves. We start by expressing the problems we want to solve. If we do it well, (and our languages are expressive enough) the expression of the problem leads to the solution. If we do it not-so-well, or if the problem is particularly complicated, we have to express the solution explicitly. In addition to problems and solutions, we can express ideas, data, relationships between data and complex interacting systems of relationships and data. Again, the greater the expressive power of our languages, the easier our job becomes. We want to express the way information should flow through our system. We want to constrain what relationships and flows are possible. We want to specify the dependencies (and independencies) between parts of our systems, we’d like to build our systems out of well-defined components that behave according to fixed and automatically enforced rules.

Just as there are a infinity of things we’d like to say, there are a multitude of languages to speak in. And just as we want to choose the right tool for the job we also want the language that gives us the right level of expressiveness. That’s not to say that there is a linear scale of expressiveness. At one level all Turing-complete languages let you say the same things — if you’re willing to bend over backwards to varying extents. Some languages allow us to say more things and with less efforts than others. But some languages just let us say different things. And even though you could twist and turn almost any language to say almost any thing, I’ve come to feel that all languages have a character — a set of assumptions, principles and decisions at the core that affects how the language and systems built with it work.

Languages don’t stand on their own. And despite my love of languages for themselves, they’re only as important as the systems we built with them. Human languages are important and beautiful in and of themselves, but they’re far more important because of the stories, histories and wisdom they embody and allow to be expressed and recorded. Our computer languages are valuable because they let us express important thoughts, let us solve important problems and built powerful systems to make our lives better. Again, we have done great things with the most primitive of tools, but if you value your time and energy it’s a good idea to pick the right tool for the job.

Personally, I’ve always been something of a language dabbler, probably a result of being in college and needing to move to a different language for another course every few months. In my time, the languages I’ve done significant amounts of work are C, C++, Java, Python, Ruby, various assemblers and Verilog (not a programming language per se, but a medium for expression nonetheless). I wouldn’t consider myself an expert in any of them but I can definitely hold my own in C and Python, maybe Ruby too if I had to. Then there are the languages I want to learn — Lisp in general (Scheme in particular) and more recently Haskell and Scala (a newly discovered appreciation of functional programming and strong type systems). They’re all different mediums of expression each letting you say different things and there’s certainly a lot I want to say.

As a PhD student in programming languages, my job is not to become an expert in different languages (though I guess that could lead to an awesome consulting gig). My job is eventually to make some contribution to the field and push the boundaries (and that involves convincing the people standing on the boundary that it actually has been pushed). Luckily for me there are definitely lots of ways in which the state of the art for programming languages can be pushed. However, to do that I first need to know where the state of the art currently lies. Over the next few months (years?) I want to get deeper into studying languages and the ideas behind them. To start off, I want to explore functional programming, type systems and macros. And I’m sure those roads will lead to more roads to explore. Yes, you are all invited along for the ride.

Mono, Clojure and the price of Free

If you’re at all interested in Open Source or Linux you’ve probably heard about the debate surrounding the inclusion of Mono and Mono-based apps in the default Ubuntu distribution.  Here is the original post that seemed to trigger it all. Here is the reply that was posted by Jo Shields, a member of the Debian Mono group. Richard Stallman also weighed in on the matter, in favor of removing Mono. Now I think having ideals and standing up for them is great and that people should always stand up for what they believe in. However, it’s one thing to stand up for your ideals and do what you have to, but it’s quite another to use your ideals as an excuse for making people use something that is of lower quality and that they don’t want to.

I think that people who advocate open source often forget that a lot of people (including a lot of people working with computer technology) care less about the licensing of their software and more about whether or not it works well. The reason open source succeeded isn’t so much due to a religious zeal to use only Free products but rather due to the fact that it let people open up the innards of their software and make changes so that things worked better. The reason I use Linux over Windows is that I can start up in 30 seconds as opposed to 5 minutes and have things organized just the way I like. And I like OS X better for managing my media and doing graphic related stuff.

The second bone I have to pick regards all the talk of “alternatives”. The original post says “There are alternatives to every Mono application that for the most part are better” and quite conveniently fails to mention any of them. Stallman says that the ‘probem’ is with applications like Tomboy which depend on Mono and also fails to mention any alternatives that would not depend on Mono. It’s this sort of condescending attitude towards developers who are doing real work that really irritates me. Stallman is a very inspiring figure who has done a lot of work for open source, but let’s not forget that a lot of ‘pure’ free software GNU projects have been stalling for years while alternatives leap ahead. Free software owes its position today in no small part to the Linux kernel, while the GNU’s own Hurd kernel which started development before Linux is still unusable for most intents and purposes. The Guile scripting language which was supposed to become standard across all GNU tools is also mostly a pipe dream. GNU’s Flash replacement is mostly inadequate. GNU Emacs is a wonderful piece of technology, but it’s losing out to newer IDEs. I personally believe these IDEs are inferior to Emacs in many ways, but are much more in-tune with modern developer needs while Emacs is in many ways is stuck in the past. “Show me the code” was a popular slogan when Microsoft threatened to use it’s patent portfolio against linux, I think it’s time Open Source enthusiasts followed their own slogans.

Finally, this whole business of “Microsoft is evil” is getting tiresome. Yes Microsoft did bad things. Yes, they’re out to get Linux and free software. Maybe Novell is a sell out. So? I don’t like them either. Don’t use Windows and close down your Hotmail account. If you read Jo Shields’reply, you’ll see that Microsoft had no hand in the development of Mono and the agreements with Novell do not cover it. Tomboy and F-Spot are free applications running on a free runtime system. It is reimplementation of a Microsoft standard, containing no Microsoft code. If you still think that Mono is ‘tainted’ because of the association, then that’s your choice and you can take Mono off your system and vow never to use Mono-reliant apps. As a user of free software that is your choice. But it is not your right nor your duty to force other people to do the same. Choice is at the core of free software and if you are trying to tell people that they can’t use or distribute some piece of software, then you are no better than Microsoft or other commercial software makers pulling users into vendor lock in.

At this point, I’m going to change track and talk about a post that appeared as a blip on Reddit a few days ago. It’s called Thumbs Down on Clojure and it basically denounces Clojure as a false Lisp because it interfaces tightly with Java and so has to abandon a lot of the ‘pure’ principles behind real Lisp. Now I have nothing at all against the author’s actual post and I actually like it that he is speaking his mind. The author drew a lot of flak from Reddit users because he placed purity above practicality and more importantly because he had nothing but vaporware to offer as an alternative.

To tell the truth, I have very mixed opinions about what this guy is doing (and talking about). Having seen the power of Lisp, I understand the attraction of purity. Having struggled to find a good cross-platform UI toolkit that doesn’t take a CS major to install makes me believe that when it comes to software, practicality trumps purity every single time (though pure and practical can coexist). And one bird in the hand is worth two in the bush. Though I use Linux and OS X, I have issues with modern operating systems in general. I would love to have an operating system with a fast, lean core but with a uniform, powerful, managed runtime on top in which all user applications are built and which achieves close-to-the-metal performance. I have some ideas on how it could be done, but I don’t have the knowledge or the resources to implement something like that. And till I’m in a position to make real contributions to the world around me, I don’t think it’s fair for me to deride the works that others are doing to make the world a little better for all of us.

The core thread behind both the above issues is that free software really isn’t free at all. There is a significant time and energy investment behind every single line of open source code that you’re using, which includes the software that runs this blog and probably the browser you’re using to read it. Ideals are well and good, but there are real .people who are behind all the free software we use today. Ultimately, the software and ideals don’t really matter, it’s the people that do. The people who make the software, the people who believe and stand up for the ideals and the people who can benefit from both the software and ideals without consciously taking part in either process. It makes me laugh when people like Stallman talk about software being ‘ethical or not’ when the only things that are really ethical are people and their actions. It almost makes me cry when people laugh and deride the work of other people but are incapable of doing anything about it. This isn’t the first time that I’ve heard a Lisp admirer say that they’re appalled at the state of modern Lisps. It’s an understandable sentiment, but what’s equally disturbing to me is that all these smart people (and some of them are really smart) can’t seem to be able to pull together and create something that can blow the socks of all the supposedely substandard solutions that people are using to get real work done right now.

So where does this leave all of us? The same place as where we started. Neither the Mono controversy nor the Lispers’ pining for the good old days is really anything new. Writing good software is hard and always has been, even with the best of tools. And it doesn’t help anyone to criticize productive projects if you don’t have an equally well working alternative. Opposing ideas on a purely ideological basis is often not the best idea (though there are notable exceptions which must not be forgotten). I think Linus Torvalds probably sums things up the best into two quotes. The first one, as I already mentioned, is “Talk is cheap. Show me the code.”. The other one which is equally valid given the circumstances and the rest of this article is “Anybody who tells me I can’t use a program because it’s not open source, go suck on rms. I’m not interested. 99% of that I run tends to be open source, but that’s my choice, dammit.”

So what are you all waiting for? Start showing some code.

Hope, scarred and bleeding

“The Heavens burned, the Stars cried out. And under the ashes of infinity, Hope, scarred and bleeding, Breathed it’s last.”

Ulatempa Poetess
Elegy for the Commonwealth

The International Lisp Conference is on right now and one of the interesting things that happened is that MIT’s Gerald Sussman talked about why MIT’s famous 6.001 course uses Python instead of Scheme. I hadn’t known that MIT had switched to Python from Scheme, and I must say that I’m very sad (as the dramatic opening quote makes obvious).

Truth be told, MIT has always been something akin to Holy Ground to me. It is deeply involved with computer science history and lots of great discoveries have come from there and from the people who worked there. 6.001 has to some extent become famous as MIT’s hallmark course. It’s the course that spawned the wonderful Structure and Implementation of Computer Programs, a book that I think all serious programmers should read at some point in their careers, preferably sooner rather than later. Since the 80’s 6.001 has used the Scheme programming language, a lightweight version of Lisp which is itself one of computer science’s shining accomplishments. Now it uses Python.

So what’s the big fuss about? Well there are a number of different reasons. 6.001 has always been meant to be a hard course, designed to fundamentally change the way students think. It’s not a simple programming course. If you read the introduction to SICP (and then the rest of it), you’ll see that 6.001 is more of a course in engineering philosophy and techniques which only incidentally uses computers and software. I’ve been reading SICP in bits and pieces for about a year and half and the things I’ve learned have made me a much better programmer, engineer and thinker. Is it possible to have the same sort of teaching and learning experience with Python instead of Scheme? Perhaps. I’m not sure. But that’s the tip of the iceberg.

Reading the reasons Sussman gave for replacing Scheme with Python is very distressing. Basically it comes down to the admission of the fact that the software industry is for the most part, one giant mess. Software isn’t well documented and the interfaces are inconsistent. As a result  you spend a lot of time tinkering with 3rd party libraries to figure how they work as opposed to how they are supposed to work. All this flies in the face of what 6.001 was supposed to teach: great engineering works are built out of simpler, smaller components that are well understood and encapsulated. I’ll be the first to admit that what Sussman says is completely true, I’ve experienced it myself more than once. This hasn’t been helped by the proliferation of CS courses that teach that students to be dependent on frameworks and third libraries without teaching them about how those frameworks work or what to do when they break. It’s part of the game and something you have to learn to deal with.

But it seems to me that Sussman’s statement is like throwing in the towel. If you can’t beat them, join them, that sort of thing. I’m sure there’s more to the argument than just that, but it would be hard to avoid the fact that 6.001 has been substantially toned down. There is no way students are going to learn what good software is unless they have some experiences of making it, themselves. There’s a reason why math students still work on problems even though Mathematica is a great tool.  Python is a great beginner’s language, but 6.001 wasn’t exactly made for beginners. Referring back to the preface to the first edition of SICP, the author says that many students have had experience of computers. Let’s take some time to remember that when this was written decades ago, ‘experience of computers’ didn’t mean playing games, using Office or browsing the web. It probably implied a general understanding of how computers work and some amount of programming. Times change and one could say that 6.001 is just adapting to the times. That could certainly be true and if it is, it’s a sign that the change isn’t good.

If students are starting programming at a later age, and hence aren’t accustomed to the very demanding logical thought required by 6.001, then it’s a sign that the software industry is in grave danger. Let’s face it: 4 years are not enough to learn how to be a good programmer or software engineer. 4 years working full time at it wouldn’t be enough. Now throw in the typical trappings of college life: other courses, sports, parties, relationships. The time most CS students spend actually practicing their skills go down dramatically. And it’s not like computer science is easy, we all know that. After graduation a job means that you have even little time and inclination to improve yourself. CS education needs to start early. The best programmers I know (both personally and otherwise) are the ones who started young, in their mid-teens mostly. The industry depends on mediocre programmers who are then most likely doomed to stay mediocre because they missed the time of their lives where they could have learned the most.

You see, the problem and feelings of hopelessness go beyond 6.001. Being a student myself, I see a vast disparity in the skills of my fellow CS majors. What’s more terrifying is how low the low point is.  A freshman I know is currently building a web framework for automatically converting Java object models to visual representations which can be manipulated. Think of it as interactive UML on steroids. It wouldn’t be an exaggeration to say he’s probably the only one in my school who can do that, me included (though our CS department is probably under 75 students). In one of my 300-level classes, a student claimed he didn’t know what a binary representation of an integer is. How can you be in a sophomore CS major without knowing what a binary integer is? And my CS department is one of the more rigorous ones I know of. We use a wide variety of languages and tools, have a separate curriculum for non-majors and the requirements for graduating with honors are pretty strict. The professors routinely give students experience out of class by involving them in research projects.

In some respects, you could say I don’t need to care. I get good grades and I’ve seen and know more than most of fellow students. But it’s not me I’m worried about. Not yet at least. I’d like to make a lasting contribution to computer science at some point in my career, but equally importantly I’d like to work with people who love computer technology as much as I do and are also skilled and dedicated enough to work on something that could be groundbreaking. I’m really afraid that there might be fewer people like that in my age group. Perhaps that’s just because I live in a small school, but if there aren’t going to be any more 18-year olds spinning Scheme abstractions with ease, then my fears seem terribly close to justified.

I’m going to stop now because I feel like I’m screaming at the top of my lungs in an empty room. And I have to call home. I’m willing to believe that the picture isn’t as bleak because I see people like my freshman friend. But it’s takes a good amount of will to keep it up. If you’re in college (or know someone in college) who would like to get together and maybe reverse this trend of approaching mediocrity, please get in touch with me. I could really use a hope infusion right now.

PS. On an equally terrifying note, there’s only one girl in my software engineering class of 27, but that’s not something I’m going try to explain any time soon.

Languages, markup and quote unquote

I was at a lecture the other day and I heard the speaker say ‘quote … something something … end quote’. I don’t like it when people say ‘quote … end quote’ to show that they are quoting another source. I think it’s a very ugly way of talking. It seems that if you need to actually say ‘quote … unquote’ to let your listener when a quotation starts or stops, then you might as well start saying your commas and full stops out loud. Human to human communication is very interesting, especially when it is live. When two or more people speak there’s a lot more going on than simple interchange of meaningful words. The speakers tone of voice, inflection, the presence and duration of pauses, all combine to give a lot more information than the words on their own could. Good speakers use all these properties of verbal communication to make their words more effective and meaningful. I think that a proper use of vocal tone, rhythm and pausing can make it clear when a quote starts and when it ends. Unfortunately, it seems that most people don’t really think about this. There seems to be a thought that the presence of quotation marks in text means that there needs to be an equivalent expression when speaking. But this idea stems in part from an important fact being forgotten: spoken and written language are not the same thing.

Take a look at any piece of text. This post that you are now reading will do perfectly. When a language is written down, there is a lot more information than goes down than is actually spoken. Punctuation marks are the prime example. They’re special symbols that represent the elements of natural speech that don’t correspond to actual words. Periods and commas represent pauses. Apostrophes represent the omissions that are made as two words are merged to make speech easier or faster. They’re efforts to better approximate the non-factual parts of speech. But of course they don’t go the whole. Rhythm, speed and tone are hard to put in written form, at least for a language like English. Humans have been communicating with other humans for thousands of years now and we’re still not perfect at it.

This imperfection isn’t just limited to natural speaking or riding though. Computerized communication and information have their own set of problems. As an example, take the history of markup languages. Wikipedia defines a markup language as “a set of codes that give instructions regarding the structure of a text or how it is to be displayed”. Plain text has always been a sort of ‘natural’ medium for computers to information data. However simply recording the contents of the text is often not good enough. You want some information to tell different parts of the text apart. You can use this information to control how the text is displayed, or whether it has some special meaning to the computer. The TeX typesetting uses a markup language to give very precise instructions to a program (called tex) on how different parts of a document should be arranged when printed. The printing can be to paper as originally intended or some electronic format like PDF.

While Tex is focused on presentation, SGML and it’s popular derivatives HTML and XML target a different problem: what does the text actually mean ? How it looks is handled by stylesheets (CSS for HTML, XSLT for XML), which match the semantic components of the text to corresponding presentation style. This means that the meaning of the text can be kept separate from presentation and that multiple styles can be applied to the same content. And lest anyone tell you otherwise, presentation is important. Really important. The first iterations of HTML lacked any clearly thought-out presentation rules, leading to a proliferation of custom HTML pseudo-tags and the proliferation of table based layouts. CSS helped the situation at the cost of layering on yet another language with it’s own set of standardization requirements. XML is becoming very popular as popular data interchange format, however it is not without it’s share of flaws. Firstly, it’s very verbose. Writing any form of XML by hand is not an enjoyable task. It’s hard to throw together a quick regular expression to make sense of a large XML structure, but this is mitigated in part by the presence of some really good XML manipulation tools.

So much for the acronyms. Where were we? Information interchange. We still haven’t figured out the best way to do it, or even agreed on a good way to do it. Inert static data is one thing, but changing active data is quite another. Computer programs are in essence information. In the Lisp programming language there is no need for any differentiation between code and data. Code can be represented very easily as tree-like data structures and manipulated just as easily. This gives Lisp a level of expressive power that is orders of magnitude above any other programming language. Going the other way, you can just as easily transform your data into code. If you can find a way to store your information as Lisp-style S-expressions, then you can turn them into self-manipulating programs later. Of course, no one said it would be easy. Steve Yegge has a slightly dated but very appropriate article on just this topic.

So how does all this affect you and me? Honesly, I don’t really know. I know that many of the tools we have (both in terms of natural and computer language) are very good, but there certainly isn’t one size that fits all. You really shouldn’t be saying “quote … endquote” when you can change the way your voice sounds to signify the same thing. You shouldn’t be using XML when a simple regexp parseable solution. But you shouldn’t avoid XML if the alternative is writing your own deterministic finite automata that is soon going to rival an enterprise-strength XML parser in complexity. The problem, as they say, is choice. Make yours carefully.