A partial solution to the format problem

One of my pet peeves is the format problem: there are too many mutually incompatible formats out there for you to choose from if you have some data you want to store and present to a large group of people. Since I don’t use sound or video (or even images) that much, the subset of this problem that I’m concerned with relates to text data.

The problem goes something like this: suppose I have a large-ish chunk of text that I want to send out to a lot of people, how should I do it? First off, we’re going to assume that we will be composing the text electronically on a computer, but we will have to hand it out both electronically and in print. Second, our goal is to have as few distinct copies of the text as possible so that we don’t have to go about editing a bunch of different files if we make a change. For the electronic copies, the end viewers should not have to install any special software to see what we send them.

The most common solution that the average computer user would use is to make a Word document. Like it or not, Word is a mostly universal format for exchanging documents. The chances that someone does not have Word installed is really quite slim. However, just because it’s popular doesn’t mean it’s good. Considering that I’m a Linux and Open Source enthusiast (and so are a lot of the people I communicate with) I can’t depend on them having Word and I feel slightly guilty about using something proprietary. Also, even if Word was free there are excellent arguments to made against using word processors in general and I agree with them.

Personally, I haven’t used Word for serious document creation for about 3 years now. However, the alternatives aren’t very easy to come up with. It’s taken me a few years, but I’ve finally come to a system that I can use full time. The main realization for me was that I create two main types of text documents: ones that will printed and given out to professors and other students, and electronic documents that I put on the web and I will not print myself. For a long time, I wanted a way to be able to do both in a single shot: I wanted a format where I could create good looking webpages as well as pretty print output. Unfortunately, I haven’t yet found anything that is quite so easy to use.

Instead I’ve settled for two partial solutions. For printing, the good news is that Latex is really the state of the art when it comes to preparing good looking print documents. I’ve never made documents in Word that looked as carefully put together as a Latex document. Sure there’s a learning curve, but it’s one that I’ll happily live with in exchange for great looking documents.

Technically Latex can be converted to HTML. But I’ve never done this because I’ve never really run a website in pure HTML In the old days, I used Dreamweaver and as I started blogging I used Blogger and now WordPress. It’s only recently that I started to keep a small site static in HTML. In the process of making this site, I’ve realized that the set of writings I print and distribute and the ones I put online, are mostly independent of each other. On my website for example I have a page about my computers. That’s something I can’t see myself printing to give to someone else. At the same time, I’ve done a lot of writing this semester that I haven’t put online because they are in a state of flux and I’m not ready to share them with the world at large yet.

it doesn’t make sense me to write a lot of Latex source for something that I could write using a plain-text markup and then convert to HTML. Also the advanced typesetting used by Latex is lost in webpages. Lately I’ve started using Markdown to write my webpages in plain text and then generate HTML which I can insert into templates. This is a very simple solution where the source is easy to read and create but the output looks good and is simple to generate.

What I’ve settled on is to use both Latex and Markdown/HTML, but independently of each other. Things that need to get printed, such as papers and later my theses, will be typeset in Latex where I’m guaranteed to have good looking output. Anything else that I put online (such as my site) will be Markdown¬† automatically converted to HTML.

Of course, this isn’t a complete solution at all. One the one hand, some of the things I write in Latex, I might want to put online someday. Then there are things like this blog that are neither Markdown nor Latex and that is something I haven’t talked about yet. But it’s a good start and it’s something that I find fairly comfortable using right now.

Too many formats

For most of last week, I’ve been evaluating various options for starting a public, personal wiki. I’ve looked at a number of solutions, both large-scale by wiki providers and homegrown using open source tools. I still haven’t made my decision and to tell the truth I am getting a little frustrated at this point. However, if there’s anything that I learned, it’s that there are way too many formats out there.

I’m not talking about large-scale industry standard formats like HD-DVD vs Blu-Ray or something on the level of paper vs digital. I’m talking about much simpler things like the variety of publishing formats and ad-hoc text formats floating around in cyberspace. The lingua franca of the internet is still HTML, at least if you want a simple website as opposed to something running on a content management system. However, if you’ve done any sort of web development, then you’ll know that building websites from scratch using HTML is not fun. It’s XML after all, and no one should have to write XML by hand.

Even without straight HTML, there’s still a ton of formats to choose form. If you want others to read something you’ve written, what do you choose? You could use a word processor format like .doc or the newer OOXML (.docx) or if you’re more of an open source fan, OpenDocument might be your thing. But it’s a bit harder to spread an office file like that. You can’t just drop it onto a webpage unless you convert it to HTML first. You could email to people, but that doesn’t scale and most people might not care enough to actually open it. Same arguments go for PDFs.

If you really¬† want a lot of people to read what you’ve written, you want to provide in a form that’s accessible via a plain browser, hopefully without plugins. You could use something like Scribd‘s iPaper, but I think that’s more useful if you have a complicated PDF that you want to show without people having to actually download and open the PDF. It’s a bit overkill for normal text. But we alread decided that writing HTML is bad and so like the hackers we are, we’re going to find a way around it.

You could go the CMS-route and use something like WordPress (for blogs) or MediaWiki (for wikis) or Drupal (for something more versatile). These have the advantage of having a pretty WYSIWYG interface and a whole list of administrative features. But they require you to have your own server or web space or find a free host like WordPress.com. But if you’re someone like me, you want a simpler solution that’s more under your direct control and that you can add on to later as your needs increase. The good news and bad news is that there are a bunch of simpler, human-readable (and writeable) markup formats that can be translated to HTML with fairly good results.

The good news is that these formats are simple and the HTML conversion tools are mostly open source and of good quality. The bad news is that there are a multitude of such markups, all of which are mutually incompatible. The ones that seem most popular seem to be Markdown, Textile and reStructured Text. All of them have their strengths and weaknesses and I don’t really like any of them. If you can pick one and use it right, you can have a good experience. But you can just as easily push against the boundaries of what they offer and end up being frustrated by their limitations.

The above are all markups that are meant to be translated to HTML at some point, but can be read by people directly. Though they do allow some form of structuring (in the form of HTML headings) and allow inline HTML too, they’re not really all that good for highly structured text, like when you’re trying to make outlines for scholarly papers. There are other text-based tools to do that and my personal favorite is Org-mode for Emacs. It’s a package for Emacs that turns it into a powerful outlining, note-taking and organization tool. It lets you create an outline as a series of headings and subheadings (nested many levels deep) along with plain text and normal lists. The different levels can then be hidden arbitrarily letting you take a bird’s eye view or just focus on one part. Many people use it for GTD or some other productivity system. I prefer Google Docs and Tasks for that, but Org-mode is a note taking tool unparalleled in it’s simplicity and ease of use.

Org-mode uses a simple, custom text format to actually store any notes you make. It’s also human readable, so you can easily copy/paste it into an email and share with others. But without actually using Org-mode, it’s hard to exploit the format to it’s full potential. It becomes even less ideal when it’s exported to some other format like HTML. The org-mode concept of headers don’t neatly map onto HTML headings. Org-mode encourages header nesting which looks terrible in HTML unless you carefully lay out a CSS stylesheet for it. You can use the headers more judiciously preferring to use plain lists, but that defeats the purpose of Org-mode to some extent. It seems to me that you can’t have the bost of best worlds.

At this point any self-respecting hacker will point out that I could just stop whining and start writing converters from one format to another. After all, they all convert to HTML, it can’t be that hard to convert between them, right? It’s probably not and the Pandoc system does it to some extent, but the problem isn’t with the formats themselves as much as with the tools that work on them. HTML is good for publishing normal documents, but if you have something with many levels of nesting such as an Org-mode document, you really need something that doesn’t show all the information at once. Unfortunately there isn’t an easy way to do this without resorting to JavaScript or something similar. HTML in it’s raw form is still a very static format, presentation wise and it doesn’t always scale well to complex sets of information.

I’m going to take a break for a moment and think about what an ideal system would be like. It would be based on a simple text format that was explicit about what things meant, so you didn’t need a reference. One of my complaints about Markdown is that anything indenting 4 spaces gets treated as a block of unformatted HTML placed withing <pre><code> tags. There’s no way you would guess this by looking at the plain text and it also makes list nesting and using indentation to present your text very awkward. But I digress. While you could write this text in a plain editor, the preferred way would be in an editor that supported folding and searching and all the other editing niceties. The editor would actually double as the presentation so that there would be any export or rendering step. And it will probably be on the web. The editor will be in a browser and the actual data will be on a remote server. However, unlike many web services today, export and import to a local form will be a core strength. This way you can pull the bare text out of the editor and send to others who don’t use the same tools as you without having to go through some silly registration/sign-up step.

I started this post thinking it would be about the markup, but now at the end its turned out to be about the tools as well. HTML is a great medium because it’s so easy to render and produce. However, our data needs are going beyond what a simple document-oriented format can easily supported. I don’t think Org-mode or the like will ever become the de facto standard. But the fact that a lot of very smart people choose it over mode ‘modern’ Web 2.0 stuff is a testament to the power of simple, easily editable formats. I hope we could standardize around a dual markup system that had a simple human readable form for quick writing and a more snazzy display form. I doubt that’s going to happen any time soon, so till then I’m looking for the perfect set of text tools to store my data and ultimately show it to the world. All this ties in directly to my efforts to having a wiki that’s easy to create, easy to back up and good to look at. More on that later.