A partial solution to the format problem

One of my pet peeves is the format problem: there are too many mutually incompatible formats out there for you to choose from if you have some data you want to store and present to a large group of people. Since I don’t use sound or video (or even images) that much, the subset of this problem that I’m concerned with relates to text data.

The problem goes something like this: suppose I have a large-ish chunk of text that I want to send out to a lot of people, how should I do it? First off, we’re going to assume that we will be composing the text electronically on a computer, but we will have to hand it out both electronically and in print. Second, our goal is to have as few distinct copies of the text as possible so that we don’t have to go about editing a bunch of different files if we make a change. For the electronic copies, the end viewers should not have to install any special software to see what we send them.

The most common solution that the average computer user would use is to make a Word document. Like it or not, Word is a mostly universal format for exchanging documents. The chances that someone does not have Word installed is really quite slim. However, just because it’s popular doesn’t mean it’s good. Considering that I’m a Linux and Open Source enthusiast (and so are a lot of the people I communicate with) I can’t depend on them having Word and I feel slightly guilty about using something proprietary. Also, even if Word was free there are excellent arguments to made against using word processors in general and I agree with them.

Personally, I haven’t used Word for serious document creation for about 3 years now. However, the alternatives aren’t very easy to come up with. It’s taken me a few years, but I’ve finally come to a system that I can use full time. The main realization for me was that I create two main types of text documents: ones that will printed and given out to professors and other students, and electronic documents that I put on the web and I will not print myself. For a long time, I wanted a way to be able to do both in a single shot: I wanted a format where I could create good looking webpages as well as pretty print output. Unfortunately, I haven’t yet found anything that is quite so easy to use.

Instead I’ve settled for two partial solutions. For printing, the good news is that Latex is really the state of the art when it comes to preparing good looking print documents. I’ve never made documents in Word that looked as carefully put together as a Latex document. Sure there’s a learning curve, but it’s one that I’ll happily live with in exchange for great looking documents.

Technically Latex can be converted to HTML. But I’ve never done this because I’ve never really run a website in pure HTML In the old days, I used Dreamweaver and as I started blogging I used Blogger and now WordPress. It’s only recently that I started to keep a small site static in HTML. In the process of making this site, I’ve realized that the set of writings I print and distribute and the ones I put online, are mostly independent of each other. On my website for example I have a page about my computers. That’s something I can’t see myself printing to give to someone else. At the same time, I’ve done a lot of writing this semester that I haven’t put online because they are in a state of flux and I’m not ready to share them with the world at large yet.

it doesn’t make sense me to write a lot of Latex source for something that I could write using a plain-text markup and then convert to HTML. Also the advanced typesetting used by Latex is lost in webpages. Lately I’ve started using Markdown to write my webpages in plain text and then generate HTML which I can insert into templates. This is a very simple solution where the source is easy to read and create but the output looks good and is simple to generate.

What I’ve settled on is to use both Latex and Markdown/HTML, but independently of each other. Things that need to get printed, such as papers and later my theses, will be typeset in Latex where I’m guaranteed to have good looking output. Anything else that I put online (such as my site) will be Markdown¬† automatically converted to HTML.

Of course, this isn’t a complete solution at all. One the one hand, some of the things I write in Latex, I might want to put online someday. Then there are things like this blog that are neither Markdown nor Latex and that is something I haven’t talked about yet. But it’s a good start and it’s something that I find fairly comfortable using right now.

Too many formats

For most of last week, I’ve been evaluating various options for starting a public, personal wiki. I’ve looked at a number of solutions, both large-scale by wiki providers and homegrown using open source tools. I still haven’t made my decision and to tell the truth I am getting a little frustrated at this point. However, if there’s anything that I learned, it’s that there are way too many formats out there.

I’m not talking about large-scale industry standard formats like HD-DVD vs Blu-Ray or something on the level of paper vs digital. I’m talking about much simpler things like the variety of publishing formats and ad-hoc text formats floating around in cyberspace. The lingua franca of the internet is still HTML, at least if you want a simple website as opposed to something running on a content management system. However, if you’ve done any sort of web development, then you’ll know that building websites from scratch using HTML is not fun. It’s XML after all, and no one should have to write XML by hand.

Even without straight HTML, there’s still a ton of formats to choose form. If you want others to read something you’ve written, what do you choose? You could use a word processor format like .doc or the newer OOXML (.docx) or if you’re more of an open source fan, OpenDocument might be your thing. But it’s a bit harder to spread an office file like that. You can’t just drop it onto a webpage unless you convert it to HTML first. You could email to people, but that doesn’t scale and most people might not care enough to actually open it. Same arguments go for PDFs.

If you really¬† want a lot of people to read what you’ve written, you want to provide in a form that’s accessible via a plain browser, hopefully without plugins. You could use something like Scribd‘s iPaper, but I think that’s more useful if you have a complicated PDF that you want to show without people having to actually download and open the PDF. It’s a bit overkill for normal text. But we alread decided that writing HTML is bad and so like the hackers we are, we’re going to find a way around it.

You could go the CMS-route and use something like WordPress (for blogs) or MediaWiki (for wikis) or Drupal (for something more versatile). These have the advantage of having a pretty WYSIWYG interface and a whole list of administrative features. But they require you to have your own server or web space or find a free host like WordPress.com. But if you’re someone like me, you want a simpler solution that’s more under your direct control and that you can add on to later as your needs increase. The good news and bad news is that there are a bunch of simpler, human-readable (and writeable) markup formats that can be translated to HTML with fairly good results.

The good news is that these formats are simple and the HTML conversion tools are mostly open source and of good quality. The bad news is that there are a multitude of such markups, all of which are mutually incompatible. The ones that seem most popular seem to be Markdown, Textile and reStructured Text. All of them have their strengths and weaknesses and I don’t really like any of them. If you can pick one and use it right, you can have a good experience. But you can just as easily push against the boundaries of what they offer and end up being frustrated by their limitations.

The above are all markups that are meant to be translated to HTML at some point, but can be read by people directly. Though they do allow some form of structuring (in the form of HTML headings) and allow inline HTML too, they’re not really all that good for highly structured text, like when you’re trying to make outlines for scholarly papers. There are other text-based tools to do that and my personal favorite is Org-mode for Emacs. It’s a package for Emacs that turns it into a powerful outlining, note-taking and organization tool. It lets you create an outline as a series of headings and subheadings (nested many levels deep) along with plain text and normal lists. The different levels can then be hidden arbitrarily letting you take a bird’s eye view or just focus on one part. Many people use it for GTD or some other productivity system. I prefer Google Docs and Tasks for that, but Org-mode is a note taking tool unparalleled in it’s simplicity and ease of use.

Org-mode uses a simple, custom text format to actually store any notes you make. It’s also human readable, so you can easily copy/paste it into an email and share with others. But without actually using Org-mode, it’s hard to exploit the format to it’s full potential. It becomes even less ideal when it’s exported to some other format like HTML. The org-mode concept of headers don’t neatly map onto HTML headings. Org-mode encourages header nesting which looks terrible in HTML unless you carefully lay out a CSS stylesheet for it. You can use the headers more judiciously preferring to use plain lists, but that defeats the purpose of Org-mode to some extent. It seems to me that you can’t have the bost of best worlds.

At this point any self-respecting hacker will point out that I could just stop whining and start writing converters from one format to another. After all, they all convert to HTML, it can’t be that hard to convert between them, right? It’s probably not and the Pandoc system does it to some extent, but the problem isn’t with the formats themselves as much as with the tools that work on them. HTML is good for publishing normal documents, but if you have something with many levels of nesting such as an Org-mode document, you really need something that doesn’t show all the information at once. Unfortunately there isn’t an easy way to do this without resorting to JavaScript or something similar. HTML in it’s raw form is still a very static format, presentation wise and it doesn’t always scale well to complex sets of information.

I’m going to take a break for a moment and think about what an ideal system would be like. It would be based on a simple text format that was explicit about what things meant, so you didn’t need a reference. One of my complaints about Markdown is that anything indenting 4 spaces gets treated as a block of unformatted HTML placed withing <pre><code> tags. There’s no way you would guess this by looking at the plain text and it also makes list nesting and using indentation to present your text very awkward. But I digress. While you could write this text in a plain editor, the preferred way would be in an editor that supported folding and searching and all the other editing niceties. The editor would actually double as the presentation so that there would be any export or rendering step. And it will probably be on the web. The editor will be in a browser and the actual data will be on a remote server. However, unlike many web services today, export and import to a local form will be a core strength. This way you can pull the bare text out of the editor and send to others who don’t use the same tools as you without having to go through some silly registration/sign-up step.

I started this post thinking it would be about the markup, but now at the end its turned out to be about the tools as well. HTML is a great medium because it’s so easy to render and produce. However, our data needs are going beyond what a simple document-oriented format can easily supported. I don’t think Org-mode or the like will ever become the de facto standard. But the fact that a lot of very smart people choose it over mode ‘modern’ Web 2.0 stuff is a testament to the power of simple, easily editable formats. I hope we could standardize around a dual markup system that had a simple human readable form for quick writing and a more snazzy display form. I doubt that’s going to happen any time soon, so till then I’m looking for the perfect set of text tools to store my data and ultimately show it to the world. All this ties in directly to my efforts to having a wiki that’s easy to create, easy to back up and good to look at. More on that later.

The Documentation Problem

Over the past year and a half I’ve come to realize that writing documentation for your programs is important. Not only is documentation helpful for your users, it forces you to think about and explain the workings of your code. Unfortunately, I’ve found the tools used to create documentation (especially user-oriented documentation) to be somewhat lacking. While we have powerful programmable IDEs and equally powerful version control and distribution systems, the corresponding tools for writing documentation aren’t quite at the same level.

Let me start by acknowledging that there are different types of documentation for different purposes. In-code documentation in the form of comments are geared toward people who will be using your code directly, either editing it or using the API that it exposes. In this area there are actually good tools available. Systems like Doxygen, Epydoc or Javadoc can take formatted comments in code and turn them into API references in the form of HTML or other formats. Having the API info right in the code, it’s easier to make sure that changes in one are reflected in the other.

User-oriented documentation has slightly different needs. As a programmer you want a system that is easy to learn and is fast to use. You also want to be able to publish it different formats. At the very least you want to be able to create HTML pages from a template. But you also want the actual source to be human-readable (that’s actually a side-effect of being easy to create) because that’s probably what you, as the creator, will be reading and editing the most.

Then there are documents that are created as part of the design and coding process. This is generally wiki territory. A lot of this is stuff that will be rewritten over and over as time progresses. At the same time, it’s possible that much of this will eventually find its way into the user docs. In this case, ease of use is paramount. You want to get your thoughts and ideas down as quickly as possible so that you can move on to the next thought. Version controlling is also good to have so that you can see the evolution of the project over time. You might also want some sort of export feature so that you can get a dump of the wiki when necessary.

Personally, I would like to see the user doc and development wikis running as two parts of the same documentation system. Unfortunately, I haven’t found tools that are quite suitable. I would like all the documentation to be part of the same repository where all my code is stored. However, this same documentation needs to be easily exported to decent looking web pages and PDFs and placed online with minimal effort on my part. The editing tools also need to be simple and quick with a minimal learning curve.

There are several free online wiki providers out there such as PBworks and WikiDot which allow the easy creation of good looking wikis. But I’m hesitant to use any of them since there isn’t an easy way to easily tie them into Git. Another solution is to make use of Github’s Pages features. Github lets you host your git repositories online so that others can get them easily and start hacking on them. The Pages features allows you to create simple text files with either the Textile or Markdown formatting systems and have them automatically turned into good looking HTML pages. This is a good idea on the whole and the system seems fairly straightforward to use, with some initial setup. The engine behind Pages, called Jekyll is also free to download and use on your own website (and doesn’t require a Git repository).

In addition to these ‘enterprise-quality’ solutions, there are also a number of smaller, more home-grown solutions (though it could be argued that Jekyll is very homegrown). There’s git-wiki, a simple wiki system written in Ruby using Git as the backend. Ikiwiki is a Git or Mercurial based wiki compiler, in that it takes in pages written in a wiki syntax and creates HTML pages. These are viable solutions if you like to have complete control of how your documentation is generated and stored.

Though each of these are great in and of themselves, I still can’t help feeling that there is something missing. In particular, there is lack of a common consensus of how documentation should be created and presented. Some projects have static websites, others have wikis, a few have downloadable PDFs. Equally importantly there isn’t even a moderately common system for creating this documentation. There are all the ways I’ve noted above, which seem to be the most popular. There are also more formal standards like DocBook. Finally lets not forget man and info pages. You can also create your own documentation purely by hand using HTML or LaTex. Contrast this to the way software distribution works (at least in open source): there are binary packages and source tarballs and in many cases some sort of bleeding-edge repository access. There are some exceptions and variations in detail, but by and large things are similar across the board.

Personally, I still can’t make up my mind as to how to manage my own documentation. I like the professional output that LaTex provides and DocBook seems like a well-thought-out standard, but I’d rather not deal with the formatting requirements, especially in documents that can change easily. I really like wikis for ease of use and anywhere editability, but I must be able to save things to my personal repository and I don’t want to host my own wiki server. I’ve previously just written documentation in plain text files and though this is good for just getting the information down, it’s not really something that can be shown to the outside world. For the time being, I’ll be sticking to my plain text files, but I’m seriously considering using Github Pages. For me this offers the dual benefit of easy creation in the form of text files as well having decent online output for whatever I decide to share. I lose the ability to edit from anywhere via the internet, but that’s a price I’m willing to pay. I can still use Google Docs as a emergency temporary staging area. I’m interested in learning how other developers organize their documentation and would gladly hear any advice. There’s a strong chance that my system will change in some way in the future, but that’s true of any system I might adopt.

Languages, markup and quote unquote

I was at a lecture the other day and I heard the speaker say ‘quote … something something … end quote’. I don’t like it when people say ‘quote … end quote’ to show that they are quoting another source. I think it’s a very ugly way of talking. It seems that if you need to actually say ‘quote … unquote’ to let your listener when a quotation starts or stops, then you might as well start saying your commas and full stops out loud. Human to human communication is very interesting, especially when it is live. When two or more people speak there’s a lot more going on than simple interchange of meaningful words. The speakers tone of voice, inflection, the presence and duration of pauses, all combine to give a lot more information than the words on their own could. Good speakers use all these properties of verbal communication to make their words more effective and meaningful. I think that a proper use of vocal tone, rhythm and pausing can make it clear when a quote starts and when it ends. Unfortunately, it seems that most people don’t really think about this. There seems to be a thought that the presence of quotation marks in text means that there needs to be an equivalent expression when speaking. But this idea stems in part from an important fact being forgotten: spoken and written language are not the same thing.

Take a look at any piece of text. This post that you are now reading will do perfectly. When a language is written down, there is a lot more information than goes down than is actually spoken. Punctuation marks are the prime example. They’re special symbols that represent the elements of natural speech that don’t correspond to actual words. Periods and commas represent pauses. Apostrophes represent the omissions that are made as two words are merged to make speech easier or faster. They’re efforts to better approximate the non-factual parts of speech. But of course they don’t go the whole. Rhythm, speed and tone are hard to put in written form, at least for a language like English. Humans have been communicating with other humans for thousands of years now and we’re still not perfect at it.

This imperfection isn’t just limited to natural speaking or riding though. Computerized communication and information have their own set of problems. As an example, take the history of markup languages. Wikipedia defines a markup language as “a set of codes that give instructions regarding the structure of a text or how it is to be displayed”. Plain text has always been a sort of ‘natural’ medium for computers to information data. However simply recording the contents of the text is often not good enough. You want some information to tell different parts of the text apart. You can use this information to control how the text is displayed, or whether it has some special meaning to the computer. The TeX typesetting uses a markup language to give very precise instructions to a program (called tex) on how different parts of a document should be arranged when printed. The printing can be to paper as originally intended or some electronic format like PDF.

While Tex is focused on presentation, SGML and it’s popular derivatives HTML and XML target a different problem: what does the text actually mean ? How it looks is handled by stylesheets (CSS for HTML, XSLT for XML), which match the semantic components of the text to corresponding presentation style. This means that the meaning of the text can be kept separate from presentation and that multiple styles can be applied to the same content. And lest anyone tell you otherwise, presentation is important. Really important. The first iterations of HTML lacked any clearly thought-out presentation rules, leading to a proliferation of custom HTML pseudo-tags and the proliferation of table based layouts. CSS helped the situation at the cost of layering on yet another language with it’s own set of standardization requirements. XML is becoming very popular as popular data interchange format, however it is not without it’s share of flaws. Firstly, it’s very verbose. Writing any form of XML by hand is not an enjoyable task. It’s hard to throw together a quick regular expression to make sense of a large XML structure, but this is mitigated in part by the presence of some really good XML manipulation tools.

So much for the acronyms. Where were we? Information interchange. We still haven’t figured out the best way to do it, or even agreed on a good way to do it. Inert static data is one thing, but changing active data is quite another. Computer programs are in essence information. In the Lisp programming language there is no need for any differentiation between code and data. Code can be represented very easily as tree-like data structures and manipulated just as easily. This gives Lisp a level of expressive power that is orders of magnitude above any other programming language. Going the other way, you can just as easily transform your data into code. If you can find a way to store your information as Lisp-style S-expressions, then you can turn them into self-manipulating programs later. Of course, no one said it would be easy. Steve Yegge has a slightly dated but very appropriate article on just this topic.

So how does all this affect you and me? Honesly, I don’t really know. I know that many of the tools we have (both in terms of natural and computer language) are very good, but there certainly isn’t one size that fits all. You really shouldn’t be saying “quote … endquote” when you can change the way your voice sounds to signify the same thing. You shouldn’t be using XML when a simple regexp parseable solution. But you shouldn’t avoid XML if the alternative is writing your own deterministic finite automata that is soon going to rival an enterprise-strength XML parser in complexity. The problem, as they say, is choice. Make yours carefully.