Too many formats

For most of last week, I’ve been evaluating various options for starting a public, personal wiki. I’ve looked at a number of solutions, both large-scale by wiki providers and homegrown using open source tools. I still haven’t made my decision and to tell the truth I am getting a little frustrated at this point. However, if there’s anything that I learned, it’s that there are way too many formats out there.

I’m not talking about large-scale industry standard formats like HD-DVD vs Blu-Ray or something on the level of paper vs digital. I’m talking about much simpler things like the variety of publishing formats and ad-hoc text formats floating around in cyberspace. The lingua franca of the internet is still HTML, at least if you want a simple website as opposed to something running on a content management system. However, if you’ve done any sort of web development, then you’ll know that building websites from scratch using HTML is not fun. It’s XML after all, and no one should have to write XML by hand.

Even without straight HTML, there’s still a ton of formats to choose form. If you want others to read something you’ve written, what do you choose? You could use a word processor format like .doc or the newer OOXML (.docx) or if you’re more of an open source fan, OpenDocument might be your thing. But it’s a bit harder to spread an office file like that. You can’t just drop it onto a webpage unless you convert it to HTML first. You could email to people, but that doesn’t scale and most people might not care enough to actually open it. Same arguments go for PDFs.

If you really¬† want a lot of people to read what you’ve written, you want to provide in a form that’s accessible via a plain browser, hopefully without plugins. You could use something like Scribd‘s iPaper, but I think that’s more useful if you have a complicated PDF that you want to show without people having to actually download and open the PDF. It’s a bit overkill for normal text. But we alread decided that writing HTML is bad and so like the hackers we are, we’re going to find a way around it.

You could go the CMS-route and use something like WordPress (for blogs) or MediaWiki (for wikis) or Drupal (for something more versatile). These have the advantage of having a pretty WYSIWYG interface and a whole list of administrative features. But they require you to have your own server or web space or find a free host like WordPress.com. But if you’re someone like me, you want a simpler solution that’s more under your direct control and that you can add on to later as your needs increase. The good news and bad news is that there are a bunch of simpler, human-readable (and writeable) markup formats that can be translated to HTML with fairly good results.

The good news is that these formats are simple and the HTML conversion tools are mostly open source and of good quality. The bad news is that there are a multitude of such markups, all of which are mutually incompatible. The ones that seem most popular seem to be Markdown, Textile and reStructured Text. All of them have their strengths and weaknesses and I don’t really like any of them. If you can pick one and use it right, you can have a good experience. But you can just as easily push against the boundaries of what they offer and end up being frustrated by their limitations.

The above are all markups that are meant to be translated to HTML at some point, but can be read by people directly. Though they do allow some form of structuring (in the form of HTML headings) and allow inline HTML too, they’re not really all that good for highly structured text, like when you’re trying to make outlines for scholarly papers. There are other text-based tools to do that and my personal favorite is Org-mode for Emacs. It’s a package for Emacs that turns it into a powerful outlining, note-taking and organization tool. It lets you create an outline as a series of headings and subheadings (nested many levels deep) along with plain text and normal lists. The different levels can then be hidden arbitrarily letting you take a bird’s eye view or just focus on one part. Many people use it for GTD or some other productivity system. I prefer Google Docs and Tasks for that, but Org-mode is a note taking tool unparalleled in it’s simplicity and ease of use.

Org-mode uses a simple, custom text format to actually store any notes you make. It’s also human readable, so you can easily copy/paste it into an email and share with others. But without actually using Org-mode, it’s hard to exploit the format to it’s full potential. It becomes even less ideal when it’s exported to some other format like HTML. The org-mode concept of headers don’t neatly map onto HTML headings. Org-mode encourages header nesting which looks terrible in HTML unless you carefully lay out a CSS stylesheet for it. You can use the headers more judiciously preferring to use plain lists, but that defeats the purpose of Org-mode to some extent. It seems to me that you can’t have the bost of best worlds.

At this point any self-respecting hacker will point out that I could just stop whining and start writing converters from one format to another. After all, they all convert to HTML, it can’t be that hard to convert between them, right? It’s probably not and the Pandoc system does it to some extent, but the problem isn’t with the formats themselves as much as with the tools that work on them. HTML is good for publishing normal documents, but if you have something with many levels of nesting such as an Org-mode document, you really need something that doesn’t show all the information at once. Unfortunately there isn’t an easy way to do this without resorting to JavaScript or something similar. HTML in it’s raw form is still a very static format, presentation wise and it doesn’t always scale well to complex sets of information.

I’m going to take a break for a moment and think about what an ideal system would be like. It would be based on a simple text format that was explicit about what things meant, so you didn’t need a reference. One of my complaints about Markdown is that anything indenting 4 spaces gets treated as a block of unformatted HTML placed withing <pre><code> tags. There’s no way you would guess this by looking at the plain text and it also makes list nesting and using indentation to present your text very awkward. But I digress. While you could write this text in a plain editor, the preferred way would be in an editor that supported folding and searching and all the other editing niceties. The editor would actually double as the presentation so that there would be any export or rendering step. And it will probably be on the web. The editor will be in a browser and the actual data will be on a remote server. However, unlike many web services today, export and import to a local form will be a core strength. This way you can pull the bare text out of the editor and send to others who don’t use the same tools as you without having to go through some silly registration/sign-up step.

I started this post thinking it would be about the markup, but now at the end its turned out to be about the tools as well. HTML is a great medium because it’s so easy to render and produce. However, our data needs are going beyond what a simple document-oriented format can easily supported. I don’t think Org-mode or the like will ever become the de facto standard. But the fact that a lot of very smart people choose it over mode ‘modern’ Web 2.0 stuff is a testament to the power of simple, easily editable formats. I hope we could standardize around a dual markup system that had a simple human readable form for quick writing and a more snazzy display form. I doubt that’s going to happen any time soon, so till then I’m looking for the perfect set of text tools to store my data and ultimately show it to the world. All this ties in directly to my efforts to having a wiki that’s easy to create, easy to back up and good to look at. More on that later.

About these ads

5 thoughts on “Too many formats

  1. ikiwiki is a nice option if you go the Markdown way…
    also it can be extended by plugins, so if you have
    some kind of special data you could write an exporter
    plugin for that. Plus it stores the stuff in a VCS.

  2. Pingback: Interesting Emacs Links – 2009 Week 25 « A Curious Programmer

  3. Since you mention Org mode, I take it you have used (or taken a look at) muse-mode for emacs?

    The format is markdown (somewhat), but you don’t have the usual indentation problems. In addition, there are tags for raw code and examples. I hear it does Wikis too, although this is best maintained locally.

    Just a suggestion.

  4. @Abhishek: I actually have looked at ikiwiki. I’m thinking about trying it out on my personal server once I get back to college.

    @Karthik: I’ve heard of muse-mode but haven’t really looked at it. I’m coming to realize that HTML by itself may not be very suited for highly structured data. But for normal writing Markdown seems pretty good.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s