Text processing in Python
Tips for beginners
What is Python?
Python is a freely available, very high-level, interpreted language developed by Guido van Rossum. It combines a clear syntax with powerful (but optional) object-oriented semantics. Python is widely available and highly portable.
Strings — immutable sequences
As in most higher-level programming languages, variable length strings are a basic type in Python. Python allocates memory to hold strings (or other values) "behind the scenes" where the programmer does not need to give much thought to it. Python also has several string-handling features that do not exist in other high-level languages.
In Python, strings are "immutable sequences." A program can refer to elements or subsequences of strings just as with any sequence, although strings, like tuples, cannot be modified "in place." Python refers to subsequences with the flexible "slice" operation, whose format resembles a range of rows or columns in a spreadsheet. The following interactive session illustrates the use of strings and slices:
Strings and slices
>>> s = "mary had a little lamb" >>> s # index is zero-based 'm' >>> s = 'x' # changing element in-place fails Traceback (innermost last): File "<stdin>", line 1, in ? TypeError: object doesn't support item assignment >>> s[11:18] # 'slice' a subsequence 'little ' >>> s[:4] # empty slice-begin assumes zero 'mary' >>> s # index 4 is not included in slice [:4] ' ' >>> s[5:-5] # can use "from end" index with negatives 'had a little' >>> s[:5]+s[5:] # slice-begin & slice-end are complimentary 'mary had a little lamb'
Another powerful string operation is the simple in keyword. This offers two intuitive and useful constructs:
The in keyword
>>> s = "mary had a little lamb" >>> for c in s[11:18]: print c, # print each char in slice ... l i t t l e >>> if'x'in s: print'got x' # test for char occurrence ... >>> if'y'in s: print'got y' # test for char occurrence ... got y
There are several ways to compose string literals in Python. You can use single or double quotes as long as opening and closing tokens match, and there are other variations on quoting that are frequently useful. If your string contains line breaks or embedded quotes, triple-quoting is an easy way to define a string, as in the following example:
The use of triple quotes
>>> s2 = """Mary had a little lamb ... its fleece was white as snow ... and everywhere that Mary went ... the lamb was sure to go""" >>> print s2 Mary had a little lamb its fleece was white as snow and everywhere that Mary went the lamb was sure to go
Both single-quoted or triple-quoted strings may be preceded by the letter "r" to indicate that regular expression special characters should not be interpreted by Python. For example:
>>> s3 = "this \n and \n that" >>> print s3 this and that >>> s4 = r"this \n and \n that" >>> print s4 this \n and \n that
In "r-strings," the backslash that might otherwise compose an escaped character is treated as a regular backslash. This is further explained in a later discussion of regular expressions.
Files and string variables
.readline() and .readlines() are very similar. They are both used in constructs like:
fh = open('c:\\autoexec.bat') for line in fh.readlines(): print line
The difference between .readline() and .readlines() is that the latter, like .read() , reads an entire file at once. .readlines() automatically parses the file's contents into a list of lines that can be processed by Python's for ... in ... construct. .readline() , on the other hand, reads just a single line at a time, and is generally much slower than .readlines() . .readline() should be used only if there might not be enough memory to read the entire file at once.
If you are using a standard module that deals with files, you can turn a string into a "virtual file" by using the cStringIO module (the StringIO module can be used instead in cases where subclassing the module is required, but a beginner is unlikely to need to do this). For example:
The cStringIO module
>>> import cStringIO >>> fh = cStringIO.StringIO() >>> fh.write("mary had a little lamb") >>> fh.getvalue() 'mary had a little lamb' >>> fh.seek(5) >>> fh.write('ATE') >>> fh.getvalue() 'mary ATE a little lamb'
Keep in mind, however, that unlike a real file, a cStringIO "virtual file" is not persistent. It will be gone when the program finishes if you do not take steps to save it (such as writing it to a real file, or using the shelve module or a database).
Standard module: string
A general rule-of-thumb is that if you can do a task using the string module, that is the right way to do it. In contrast to re (regular expressions), string functions are generally much faster, and in most cases they are easier to understand and maintain. Third-party Python modules, including some fast ones written in C, are available for specialized tasks, but portability and familiarity still suggest sticking with string whenever possible. There are exceptions, but not as many as you might think if you are used to other languages.
The string module contains several types of things, such as functions, methods, and classes; it also contains strings of common constants. For example:
Example 1 of string use
>>> import string >>> string.whitespace '\011\012\013\014\015 ' >>> string.uppercase 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
Although you could write these constants by hand, the string versions more or less ensure that your constants will be correct for the national language and platform your Python script is running on.
string also includes functions that transform strings in common ways (which you can combine to form several uncommon transformations). For example:
Example 2 of string use
>>> import string >>> s = "mary had a little lamb" >>> string.capwords(s) 'Mary Had A Little Lamb' >>> string.replace(s, 'little', 'ferocious') 'mary had a ferocious lamb'
There are many other transformations that are not specifically illustrated here; you can find the details in a Python manual.
You can also use string functions to report string attributes such as length or position of a sub-string, for example:
Example 3 of string use
>>> import string >>> s = "mary had a little lamb" >>> string.find(s, 'had') 5 >>> string.count(s, 'a') 4
Finally, string provides a very Pythonic oddball. The pair .split() and .join() provide a quick way to convert between strings and tuples, something you will find remarkably useful. Usage is straightforward:
Example 4 of string use
>>> import string >>> s = "mary had a little lamb" >>> L = string.split(s) >>> L ['mary', 'had', 'a', 'little', 'lamb'] >>> string.join(L, "-") 'mary-had-a-little-lamb'
Of course, in real life, you would be likely to do something else with a list besides .join() it right back together (probably something involving our familiar for ... in ... construct).
Standard module: re
The re module obsoletes the regex and regsub modules that you may see used in some older Python code. While there are still a few, limited advantages to regex , they are minor and it is not worth using in new code. The obsolete modules are likely to be dropped from future Python releases, and 1.6 will probably have an improved interface-compatible re module. So stick with re for regular expressions.
Regular expressions are complicated. One could write a book on this a topic, and in fact, a number of people have! This article will try to capture the "gestalt" of regular expressions, and let the reader take it from there.
A regular expression is a concise way of describing a pattern that might occur in a text. Do certain characters occur? In a particular order? Are sub-patterns repeated a given number of times? Do other sub-patterns exclude a match? Conceptually, this is not unlike the way you would intuitively describe a pattern in a natural language. The trick is encoding this description in the compact syntax of regular expressions.
When approaching a regular expression, treat it as its own programming problem, even though only one or two lines of code may be involved; those lines effectively form a small program.
Start with the smallest pieces. At its lowest level, any regular expression will involve matching particular "character classes." The simplest character class is a single character, which is just included in the pattern as a literal. Frequently, you will want to match a class of characters. You can indicate a class by surrounding it in square braces; within the braces, you can have both a set of characters and character ranges indicated with a dash. You can also use a number of named character classes that will be accurate for your platform and national language. Some examples:
>>> import re >>> s = "mary had a little lamb" >>> if re.search("m", s): print"Match!" # char literal ... Match! >>> if re.search("[@A-Z]", s): print"Match!"# char class ... # match either at-sign or capital letter ... >>> if re.search("\d", s): print"Match!" # digits class ...
You can think of character classes as the "atoms" of regular expressions, and you will usually want to group those atoms into "molecules." You can do this with a combination of grouping and repetition. Grouping is indicated by parentheses: any sub-expression contained in parentheses is treated as if it were atomic for the purpose of further grouping or repetition. Repetition is indicated by one of several operators: "*" means "zero or more"; "+" means "one or more"; "?" means "zero or one". For example, look at the expression:
For a string to match this expression, it must contain something that starts with "ABC" and ends with "XYZ" — but what must it have in the middle? The middle sub-expression is ([d-w]*\d\d?) followed by the "one or more" operator. So the middle of the string must consist of one (or two, or one thousand) things matching the sub-expression in parentheses. The string "ABCXYZ" will not match, because it does not have the requisite stuff in the middle.
Just what is this inner sub-expression? It starts with zero or more letters in the range d-w. It is important to notice that zero letters is a valid match, which may be counter-intuitive if you use the English word "some" to describe it. Next the string must have exactly one digit; then zero or one additional digits. (The first digit character class has no repetition operator, so it simply occurs once. The second digit character class has the "?" operator.) In short, this translates to "either one or two digits." Some strings matched by the regular expression are:
ABC1234567890XYZ ABCd12e1f37g3XYZ ABC1XYZ
These are a few expressions not matched by the regular expression (try to think through why these do not match):
ABC123456789dXYZ ABCdefghijklmnopqrstuvwXYZ ABcd12e1f37g3XYZ ABC12345%67890XYZ ABCD12E1F37G3XYZ
It takes a bit of practice to get used to creating and understanding regular expressions. Once you have mastered regular expressions, however, you will have a great deal of expressive power at your disposal. That said, it is often easy to jump into using a regular expression to solve a problem that could actually be solved using simpler (and faster) tools, such as string.