Saturday, June 13, 2009

Automate text processing with string methods

Have you ever had the need to reorganize a text file? Say you have some file that is organized in the following way:

Variable1
20
Variable2
30
....

Say you would like to reorganize that into something more spreadsheet-like, such as

Variable1 20
Variable2 30
---

Easy enough by hand in Excel or OOo calc if there are a couple of values. What if your list contains 20.000 entries? Then a script is your hero. Let's make a command line tool to reorganize such files using Python. We want the script to work like this on the command line:

>> reorganize inputfile.txt outputfile.txt

Therefore, we need to import the sys.argv list (containing command line arguments). Next, we need to use file methods to read the contents of file A into a list that we can operate on. The start of our script is thus:

#!/usr/bin/env python

from sys import argv as filenames

inputfile=open(str(filenames[1]),'r')
inputlist=inputfile.readlines()
inputfile.close()

Now we have a Python list like this:

inputlist=['Variable1\n','20\n','Variable2\n','30\n']

where the '\n' thing means line break. We cannot just write this directly to a new file, we need to strip the '\n' from list elements with even index numbers (indexing starting at zero) and we need to add '\t' to the same list elements to get a tab-separated file easily. An easy way to do this is the following; make two lists, one holding the "names" and one holding the "values". While making these lists, remove all line breaks using the replace method of string objects and replace them with tab characters ('\t'). The code:

ColA=[]; ColB=[];
for k in range(0,len(inputlist)/2):
ColA.append(inputlist[2*k].replace('\n','\t')
ColB.append(inputlist[2*k+1]

Then, simply write to the output file:

outputfile=open(filenames[2],'w')
for k in range(0,len(ColA)):
outputfile.write(ColA[k]+ColB[k])

outputfile.close()

Putting these commands together yields a script that gives you the desired result... (This is a tool I use every day - text processing is so much simpler for a scripter than for the average Excel user). Note that it is possible to do this directly in a spreadsheet using macros. I might consider writing the solution to that one day to. Or not.. :-)

2 comments:

  1. your blog eats whitespace at the beginning, mate.

    Makes it hard to read, and guarantees that people copy/pasting from the site won't make it work.

    ReplyDelete
  2. Or you could do:
    cat test.txt | awk '{x=$1;getline;y=$1;print x" "y;}' > test2.txt

    ReplyDelete