Charming Python
Parsing with the SimpleParse module
A concise and readable EBNF-style wrapper
Content series:
This content is part # of # in the series: Charming Python
This content is part of the series:Charming Python
Stay tuned for additional content in this series.
Like most programmers, I have frequently needed to identify parts and structures that exist inside textual documents: log files, configuration files, delimited data, and more free-form (but still semi-structured) report formats. All of these documents have their own "little languages" for what can occur within them.
The way I have programmed these informal parsing tasks has always been somewhat of a hodgepodge of custom state-machines, regular expressions, and context-driven string tests. The pattern in these programs was always, roughly, "read a bit of text, figure out if we can make something of it, maybe read a bit more text afterwards, keep trying."
Parsers of the formal variety distill descriptions of the parts and structures in documents into concise, clear, and declarative rules for how to identify what makes up a document. The declarative aspect is particularly interesting here. All my old ad hoc parsers were imperative in flavor: read some characters, make some decisions, accumulate some variables, rinse, repeat. As this column's installments on functional programming have observed, the recipe style of program flow is comparatively error-prone and difficult to maintain.
Formal parsers almost always use variants on Extended Backus-Naur Form (EBNF) to describe the "grammars" of the languages they describe. Those tools we look at here do so, as does the popular compiler development tool YACC (and its variants). Basically, an EBNF grammar gives names to the parts you might find in a document; additionally, larger parts are frequently composed of smaller parts. The frequency and order in which small parts may occur in larger parts is specified by operators -- mostly the same symbols you see in regular expressions. In parser-talk, each named part in a grammar is called a "production."
Possibly without even knowing it, readers have already seen EBNF descriptions at work. For example, the familiar Python Language Reference defines what a floating point number looks like in Python:
EBNF-style description of floating point number
floatnumber: pointfloat | exponentfloat pointfloat: [intpart] fraction | intpart "." exponentfloat: (nonzerodigit digit* | pointfloat) exponent intpart: nonzerodigit digit* | "0" fraction: "." digit+ exponent: ("e"|"E") ["+"|"-"] digit+
Or you might have seen an XML DTD element defined in an EBNF style. For
example, the <body>
of a developerWorks tutorial looks
like:
EBNF-style description in a developerWorks DTD
<!ELEMENT body ((example-column | image-column)?, text-column) >
Spellings vary slightly, but the general notions of quantification, alternation, and sequencing exist in all EBNF-style language grammars.
Building tag lists with
SimpleParse
SimpleParse
is an interesting tool. To use this module, you
need the underlying module mxTextTools
, which implements a
"tagging engine" in C. mxTextTools
(see Related topics later in this article) is powerful, but rather
difficult to use. Once SimpleParse
is layered on top of
mxTextTools
, the work becomes a lot easier.
Using SimpleParse
is really quite simple, because it removes
the need to think about most of the complexity of
mxTextTools
. The first thing to do is create an EBNF-style
grammar that describes the language you want to handle. The second step is
to call mxTextTools
to create a tag list that
describes all the successful productions when the grammar is applied to
the document. Finally, you actually do something with the tag list
returned by mxTextTools
.
For this article, the "language" we will parse is the set of markup codes
used by "smart ASCII" to indicate things like boldface, module names, and
book titles. This is the very same language mxTextTools
was
earlier used to identify, and regular expressions and state-machines
before that, in earlier installments. The language is far simpler than a
full programming language would be, but complicated enough to be
representative.
We probably need to back up for one moment here. What the heck is a "tag
list" that mxTextTools
gives us? Basically, this is a nested
structure that simply gives the character offsets where every production
was matched in the source text. mxTextTools
traverses a
source text quickly, but it does not do anything to the source
text itself (at least not when using the SimpleParse
grammars). Let's look at an abridged tag list:
Tag list produced from SimpleParse grammar
(1, [('plain', 0, 15, [('word', 0, 4, [('alphanums', 0, 4, [])]), ('whitespace', 4, 5, []), ('word', 5, 10, [('alphanums', 5, 10, [])]), ('whitespace', 10, 11, []), ('word', 11, 14, [('alphanums', 11, 14, [])]), ('whitespace', 14, 15, [])]), ('markup', 15, 27, ... 289)
The elipses in the middle represent a bunch more matches. But the part we see says the following. The root production ("para") succeeds and ends at offset 289 (the length of the source text). The child production "plain" matches offsets 0 through 15. This "plain" child is itself composed of smaller productions. After the "plain" production, the "markup" production matches offsets 15 through 27. The details are left out, but this first "markup" is made of components, and additional productions succeed later in the source.
An EBNF-style grammar for "smart ASCII"
We have glanced at the tag list that SimpleParse
+
mxTextTools
can give us. But we really need to look at the
grammar that was used to generate this tag list. The grammar is where the
real work happens. EBNF grammars are almost self-explanatory to read
(although designing one does require a bit of thought and
testing):
typographify.def
para := (plain / markup)+ plain := (word / whitespace / punctuation)+ whitespace := [ \t\r\n]+ alphanums := [a-zA-Z0-9]+ word := alphanums, (wordpunct, alphanums)*, contraction? wordpunct := [-_] contraction := "'", ('am'/'clock'/'d'/'ll'/'m'/'re'/'s'/'t'/'ve') markup := emph / strong / module / code / title emph := '-', plain, '-' strong := '*', plain, '*' module := '[', plain, ']' code := "'", plain, "'" title := '_', plain, '_' punctuation := (safepunct / mdash) mdash := '--' safepunct := [!@#$%^&()+=|\{}:;<>,.?/"]
This grammar is almost exactly the way you would describe the "smart ASCII" language verbally, which is a nice sort of clarity. A paragraph consist of some plain text and some marked-up text. Plain text consists of some collection of words, whitespace, and punctuation. Marked-up text might be emphasized, or strongly emphasized, or module names, etc. Strongly emphasized text is surrounded by asterisks. And so on. A couple of features like just what a "word" really is, or just what a contraction can end with, take a bit of thought, but the syntax of EBNF doesn't get in the way.
In contrast, the same sort of rules can be described even more tersely using regular expressions. This is what the first version of the "smart ASCII" markup program did. But this terseness is much harder to write, and harder still to tweak later. The following code expresses largely (but not precisely) the same set of rules:
Python regexs for smart ASCII markup
# [module] names re_mods = r"""([\(\s'/">]|^)\[(.*?)\]([<\s\.\),:;'"?!/-])"""
# *strongly emphasize* words re_strong = r"""([\(\s'/"]|^)\*(.*?)\*([\s\.\),:;'"?!/-])"""
# -emphasize- words re_emph = r"""([\(\s'/"]|^)-(.*?)-([\s\.\),:;'"?!/])"""
# _Book Title_ citations re_title = r"""([\(\s'/"]|^)_(.*?)_([\s\.\),:;'"?!/-])"""
# 'Function()' names re_funcs = r"""([\(\s/"]|^)'(.*?)'([\s\.\),:;"?!/-])"""
If you discover or invent some slightly new variant of the language, it is
a lot easier to play with the EBNF grammar than with those regular
expressions. Moreover, using mxTextTools
will generally be
even faster in performing the manipulations of the patterns.
Generating and using a tag list
For our sample program, we put the actual grammar in a separate file. For
most purposes, this is a good organization to use. Changing the grammar is
usually a different sort of task than changing the application logic; and
the files reflect this. But the whole of what we do with the grammar is
pass it as a string to a SimpleParse
function, so in
principle we could include it in the main application (or even dynamically
generate it in some way).
Let's look at our entire (compact) tagging application:
typographify.py
import os from sys import stdin, stdout, stderr from simpleparse import generator from mx.TextTools import TextTools input = stdin.read() decl = open( 'typographify.def').read() from typo_html import codes parser = generator.buildParser(decl).parserbyname('para') taglist = TextTools.tag(input, parser) for tag, beg, end, parts in taglist[1]: if tag == 'plain': stdout.write(input[beg:end]) elif tag == 'markup': markup = parts[0] mtag, mbeg, mend = markup[:3] start, stop = codes.get(mtag, ('<!-- unknown -->','<!-- / -->')) stdout.write(start + input[mbeg+1:mend-1] + stop) stderr.write('parsed %s chars of %s\n' % (taglist[-1], len(input)))
Here is what it does. First read in the grammar, and create an
mxTextTools
parser from the grammar. Next we apply the
tag-table/parser to the input source to create a tag list. Finally, we
loop through the tag list, and emit some new marked-up text. The loop
could, of course, do anything else desired with each production
encountered.
For the particular grammar used for smart ASCII, everything in the source text is expected to fall into either a "plain" production or a "markup" production. Therefore, it suffices to loop across a single level in the tag list (except when we look exactly one level lower for the specific markup production, such as "title"). But a more free-form grammar -- such as occurs for most programming languages -- could easily recursively descend into the tag list, and look for production names at every level. For example, if the grammar were to allow nested markup codes, this recursive style would probably be used. You might enjoy the exercise of figuring out how to adjust the grammar (hint: remember that productions are allowed to be mutually recursive).
The particular markup codes that go to the output live in yet another
file, for organizational not essential reasons. A little trick of using a
dictionary as a switch
statement is used here (although the
otherwise
case remains too narrow in the example). The idea
is just that we might in the future want to create multiple "output
format" files for, say, HTML, DocBook, LaTeX, or others. The particular
markup file used for the example looks like:
typo_html.py
codes = \ { 'emph' : ('<em>', '</em>'), 'strong' : ('<strong>', '</strong>'), 'module' : ('<em><code>', '</code></em>'), 'code' : ('<code>', '</code>'), 'title' : ('<cite>', '</cite>'), }
Extending this to other output formats is straightforward.
Conclusion
SimpleParse
provides a concise and very readable EBNF-style
wrapper to the underlying power and speed of the cryptic
mxTextTools
C module. Moreover, EBNF grammars are already
familiar to many programmers, even if only in passing. I cannot
prove anything about what is easier to understand --
intuitions differ -- but I can comment quantitatively on source length.
The mxTypographify
module that was manually developed earlier
is the following size:
wc mxTypographify.py
199 776 7041 mxTypographify.py
Of these 199 lines, a fair number are comments. And 18 of those lines are
an included regular expression version of the markup function that is
included for timing comparisons. But what the program does is essentially
identical to what typographify.py
-- listed above -- does. In
contrast, our SimpleParse
program, including its support
files comes to:
wc typo*.def typo*.py
19 79 645 typographify.def 20 79 721 typographify.py 6 25 205 typo_html.py 45 183 1571 total
In other words, about one fourth as many lines. This version has fewer
comments, but that is mostly because the EBNF grammar is fairly
self-documenting. I would not want to emphasize LOC too strongly --
obviously, you can play games with minimizing or maximizing code length.
But in a general way, one of the few real empirical results of work
studies on programmers is that kLOC/programmer-month is fairly close to
constant across languages and libraries. Of course, the regular expression
version is, in turn, one third as long as the SimpleParse
version -- but I think the density of its expression makes it fragile to
maintain and harder to write. I think, on balance,
SimpleParse
wins of the approaches considered.
Downloadable resources
Related topics
- Read the previous installments of Charming Python.
- Read David's previous Charming Python column on developerWorks summarizing Python's text processing facilities: Charming Python: Text processing in Python.
mxTextTools
is now part of the larger eGenix package of extensions. For more information, read mxTextTools - Fast Text Manipulation Tools for Python.- The reference module
mxTypographify
was built usingmxTextTools
directly. TheSimpleParse
version becomes much more readable: http://gnosis.cx/download/mxTypographify.py - Check out Mike
Fletcher's
SimpleParse
, along with a brief introduction to its usage. - Visit John
Aycock's
Spark
module home page.Spark
is in many ways a more sophisticated parsing framework than isSimpleParse
. A number of Python developers have recommendedSpark
, which has the additional virtue of being pure-Python (with a corresponding natural disadvantage in terms of speed). - Get more information on the ISO 14977 standard for EBNF syntax.
- Browse more Linux resources on developerWorks.
- Browse more Open source resources on developerWorks.