Using Regular Expressions and Python for text manipulation

The modern word processor, while often adequate for most people’s uses, sometimes falls woefully short when it comes to manipulating large amounts of text in complex ways. Especially when that text is not consistently formatted. I’ve heard more than once people craving for the good old days of text files and typesetters rather than today’s word processors, I’ve only recently had a first hand experience of such a situation.

I’m currently involved in a project which involves transferring a large amount (several years worth) of data from Microsoft Word files into spreadsheet. Unfortunately the data is very badly formatted. While everything looks organized, into tables and columns, the organization is only superficial. Someone evidently spent a long time manually hitting the tab key so that everything looked lined up. Even though OpenOffice.org Writer can easily convert text to a table, the fact that the tab spacing is not uniform, means that everything comes out thoroughly unusable. Not having an easy way to get the data into a spreadsheet, I was faced with the truly horrifying prospect of manually cutting and pasting over 50 pages worth of data. Now, that’s a dull and repetitive task, right? And computers are good at dull, repetitive task. So let’s get the computer to do it. Since an existing program like OpenOffice couldn’t do it, I would just have to write my own.

Luckily, two extremely useful pieces of technology came to my aid. Python’s built in file and text manipulation functions and a little gem called Regular Expressions which most of the world is sadly unaware of. Python makes it extremely easy to read and write to a file and it also has functions that allow you to break apart and join together strings of any length. My little problem didn’t so much involve joining strings, as it did replacing certain parts of them. Specifically, I wanted the uneven number of tabs to be replaced with a single special character (in this case a |) so that OpenOffice.org Calc could then open it as a data file with the columns separated by that character. When it comes to searching for and replacing certain patterns (a number of tabs after a word), you can’t get much better than regular expressions.

Regular Expressions are far too vast a topic to fully explain in a blog post. But the essentials are pretty simple. You specify a certain pattern which you want to find in a special way: a regular expression. I’m searching for a large number of tabs, one after the other, which is represented by \t+. Your expression then goes to a regular expression engine as well as the text string in which you want to find the pattern. Depending on the engine in questions you can do a number of things including counting occurrences of the string, breaking the string at those points, or replacing that pattern with something else. Python’s re module provides a number of functions to use with regular expressions. Once I utilize Python’s file reading functions to get a line of text at a time, all I need is a simple re.sub(“\t+”, ‘|’, string) function, where the arguments are the pattern to look for, the replacement and finally the string in which to search.

So, the use of something as simple, but powerful like regular expressions turns my 50-page copy paste marathon into a simple case of saving the Word documents as plain text files, running them through a 10-line Python program (I do some more text manipulation which is simple enough for Python’s string functions to handle) and then importing that into Calc using the filter for CSV files. combine this with piping in UNIX shells, and you can replace a completely manual process with mostly automated job at the cost of only a small amount of mental effort and time investment.

Finally some real work

I’ve been enjoying college life so much for the last two months that blogging was completely out of my mind. But now that the excitement has died down a bit, I think it’s time that I did started blogging regularly. Though I’ve always been interested in computers and I’ve always loved programming, I never really had the chance to any real work. Sure I’ve done work for CS courses and tried my hand at making a personal website, but nothing really serious.

However that’s changed, as a direct result of coming to college. One of the projects for one of my classes is to make a web biography of an alumnus of our college who went on to win a Nobel Prize. As it turns out, I’m the only one in the class (including the professor) who has even the faintest idea of how to go about setting up a website. So I get to be de facto webmaster. I only started really playing around with HTML and CSS earlier this year, so it’s good to have a real project to work on. Though I don’t intend to be a web designer for a living, it is one of those skills that every self-respecting computer scientist should possess. The project is still in the early stages, but I already have a template designed, and I’ve asked for space on the college servers to put it up and do proper testing. I’ll post a link here once it’s online.

Secondly, I’ve started working with a professor who is the Director of Institutional Research. Part of my job involves routinely searching the websites of other colleges for all sorts of information. Of course, Google is my friend, but manually pointing Google to websites and typing in search terms is dull,  repetitive work, the sort of work that we have computers to do. So I need someway to search all those websites and see the results in one go. I had initially thought about using a Python program to query Google and then process the result into something readable. Unfortunately the only way to query the Google engine, is by using JavaScript, preferably embedded in a web page. This means that I can either attempt to create my own search engine using Python, or learn JavaScript to build upon Google. Both of them would have been quite challenging and would have taught be a lot, but I decided to learn JavaScript. I wasn’t quite sure if I was up to the task of writing a proper search algorithm and though I certainly could have learnt what I needed to, that isn’t an investment I feel like making right now. And since we’re now living in the age of Web 2.0 and AJAX goodness, JavaScript would be a good skill to pick.

Two new projects and quite a bit to learn. And my normal classes along with that. Throw blogging regularly into the mix and things might get nasty, but I’m ready for it. I promised myself that I would get the most out of my college years and it makes sense to start early. See you all tomorrow.