Using Regular Expressions and Python for text manipulation

The modern word processor, while often adequate for most people’s uses, sometimes falls woefully short when it comes to manipulating large amounts of text in complex ways. Especially when that text is not consistently formatted. I’ve heard more than once people craving for the good old days of text files and typesetters rather than today’s word processors, I’ve only recently had a first hand experience of such a situation.

I’m currently involved in a project which involves transferring a large amount (several years worth) of data from Microsoft Word files into spreadsheet. Unfortunately the data is very badly formatted. While everything looks organized, into tables and columns, the organization is only superficial. Someone evidently spent a long time manually hitting the tab key so that everything looked lined up. Even though OpenOffice.org Writer can easily convert text to a table, the fact that the tab spacing is not uniform, means that everything comes out thoroughly unusable. Not having an easy way to get the data into a spreadsheet, I was faced with the truly horrifying prospect of manually cutting and pasting over 50 pages worth of data. Now, that’s a dull and repetitive task, right? And computers are good at dull, repetitive task. So let’s get the computer to do it. Since an existing program like OpenOffice couldn’t do it, I would just have to write my own.

Luckily, two extremely useful pieces of technology came to my aid. Python’s built in file and text manipulation functions and a little gem called Regular Expressions which most of the world is sadly unaware of. Python makes it extremely easy to read and write to a file and it also has functions that allow you to break apart and join together strings of any length. My little problem didn’t so much involve joining strings, as it did replacing certain parts of them. Specifically, I wanted the uneven number of tabs to be replaced with a single special character (in this case a |) so that OpenOffice.org Calc could then open it as a data file with the columns separated by that character. When it comes to searching for and replacing certain patterns (a number of tabs after a word), you can’t get much better than regular expressions.

Regular Expressions are far too vast a topic to fully explain in a blog post. But the essentials are pretty simple. You specify a certain pattern which you want to find in a special way: a regular expression. I’m searching for a large number of tabs, one after the other, which is represented by \t+. Your expression then goes to a regular expression engine as well as the text string in which you want to find the pattern. Depending on the engine in questions you can do a number of things including counting occurrences of the string, breaking the string at those points, or replacing that pattern with something else. Python’s re module provides a number of functions to use with regular expressions. Once I utilize Python’s file reading functions to get a line of text at a time, all I need is a simple re.sub(“\t+”, ‘|’, string) function, where the arguments are the pattern to look for, the replacement and finally the string in which to search.

So, the use of something as simple, but powerful like regular expressions turns my 50-page copy paste marathon into a simple case of saving the Word documents as plain text files, running them through a 10-line Python program (I do some more text manipulation which is simple enough for Python’s string functions to handle) and then importing that into Calc using the filter for CSV files. combine this with piping in UNIX shells, and you can replace a completely manual process with mostly automated job at the cost of only a small amount of mental effort and time investment.

Advertisements

Published by

Shrutarshi Basu

Programmer, writer and engineer, currently working out of Cornell University in Ithaca, New York.

4 thoughts on “Using Regular Expressions and Python for text manipulation”

  1. I have a unix background and spent sometime learning how to use vi and vim a clone of the vi editor. What is good about these editors is that they are scriptable and have powerful regular expression capabilities that help with problems like yours. You can get most word processors to dump a plain text form of a document, open the text in vim, and then use vims search and replace functionality. The advantage over processing with Python would be that you get qfeedback of what your script has done in the editor , rather than having to print a string.

    All-in-all though Python and other dynamic languages make fine text manipulation languages but you get payback from adding a good text editor to your arsenal such as vim (www.vim.org).

    – Paddy.

  2. By what I have seen, Openoffice has regular expressions on its find and replace functions… just open the “advanced” options in the “find” dialog box

  3. Also, for those who want to extract just the text with
    minimal formatting from Word Documents with a lot of data,
    antiword is a good program to use. Just do antiword (name of doc file)
    and it dumps the text to the terminal.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s