Using Regular Expressions and Python for text manipulation

The modern word processor, while often adequate for most people’s uses, sometimes falls woefully short when it comes to manipulating large amounts of text in complex ways. Especially when that text is not consistently formatted. I’ve heard more than once people craving for the good old days of text files and typesetters rather than today’s word processors, I’ve only recently had a first hand experience of such a situation.

I’m currently involved in a project which involves transferring a large amount (several years worth) of data from Microsoft Word files into spreadsheet. Unfortunately the data is very badly formatted. While everything looks organized, into tables and columns, the organization is only superficial. Someone evidently spent a long time manually hitting the tab key so that everything looked lined up. Even though OpenOffice.org Writer can easily convert text to a table, the fact that the tab spacing is not uniform, means that everything comes out thoroughly unusable. Not having an easy way to get the data into a spreadsheet, I was faced with the truly horrifying prospect of manually cutting and pasting over 50 pages worth of data. Now, that’s a dull and repetitive task, right? And computers are good at dull, repetitive task. So let’s get the computer to do it. Since an existing program like OpenOffice couldn’t do it, I would just have to write my own.

Luckily, two extremely useful pieces of technology came to my aid. Python’s built in file and text manipulation functions and a little gem called Regular Expressions which most of the world is sadly unaware of. Python makes it extremely easy to read and write to a file and it also has functions that allow you to break apart and join together strings of any length. My little problem didn’t so much involve joining strings, as it did replacing certain parts of them. Specifically, I wanted the uneven number of tabs to be replaced with a single special character (in this case a |) so that OpenOffice.org Calc could then open it as a data file with the columns separated by that character. When it comes to searching for and replacing certain patterns (a number of tabs after a word), you can’t get much better than regular expressions.

Regular Expressions are far too vast a topic to fully explain in a blog post. But the essentials are pretty simple. You specify a certain pattern which you want to find in a special way: a regular expression. I’m searching for a large number of tabs, one after the other, which is represented by \t+. Your expression then goes to a regular expression engine as well as the text string in which you want to find the pattern. Depending on the engine in questions you can do a number of things including counting occurrences of the string, breaking the string at those points, or replacing that pattern with something else. Python’s re module provides a number of functions to use with regular expressions. Once I utilize Python’s file reading functions to get a line of text at a time, all I need is a simple re.sub(“\t+”, ‘|’, string) function, where the arguments are the pattern to look for, the replacement and finally the string in which to search.

So, the use of something as simple, but powerful like regular expressions turns my 50-page copy paste marathon into a simple case of saving the Word documents as plain text files, running them through a 10-line Python program (I do some more text manipulation which is simple enough for Python’s string functions to handle) and then importing that into Calc using the filter for CSV files. combine this with piping in UNIX shells, and you can replace a completely manual process with mostly automated job at the cost of only a small amount of mental effort and time investment.