Skip to main content

If you don't have an IBM ID and password, register here.

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. This profile includes the first name, last name, and display name you identified when you registered with developerWorks. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

All information submitted is secure.

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

Charming Python: A look at DParser for Python

Exploring another Python parser

David Mertz, Ph.D. (mertz@gnosis.cx), Developer, Gnosis Software, Inc.
David Mertz
David Mertz has been writing the developerWorks columns Charming Python and XML Matters since 2000. Check out his book Text Processing in Python. For more on David, see his personal Web page.

Summary:  Get an introduction to DParser, a simple but powerful tool for parsing, written by J. Plevyak. Then learn about DParser for Python, which gives Python programmers a seamless interface to DParser, and see how it compares to other parsers covered in previous installments. In a manner similar to Spark or PLY, grammar rules are input to DParser using Python function documentation strings.

Date:  28 Jul 2004
Level:  Introductory

Comments:  

There are many Python parser libraries available. I have discussed mx.TextTools, SimpleParse, and SPARK in this column and covered PLY in my book (see Resources for links to these documents). Offhand, I know of PyGgy, Yapps, PLEX, PyLR, PyParsing, and TPG, and I have vague recollections of reading announcements of a half dozen others. This is a category where users might be frustrated, not by the dearth, but by the glut of high quality libraries.

What distinguishes DParser from all the others? Well, like PLY and Spark, DParser for Python utilizes function documentation strings to indicate its productions. This style lets you put action code right inside a production for events that should occur when a particular grammar rule is fulfilled; the production-in-docstring style also lets you manipulate parse trees as parsing occurs. In contrast to PLY or Spark, DParser itself is written in C, and is thus likely to be considerably faster than pure-Python parsers. DParser for Python is a fairly thin wrapper around an underlying C library -- callbacks into Python take some extra time, but the basic parsing is at C speeds. For this article, however, I have not attempted any specific benchmarks. So exactly how fast or slow DParser is compared to other parsers is not something I can directly comment on.

In my own mind, I remain quite fond of SimpleParse's approach. SimpleParse is a wrapper to the fast mx.TextTools library (also written in C) that cleanly separates an EBNF grammar language from Python code. Usually, using SimpleParse means generating a parse tree in one function call, then traversing this tree in separate code. For very large parsed documents, such a two-step approach might be inefficient, but I find it easier to understand code written this way.

Still, enough of my readers have recommended DParser for Python that it warrants a look, despite my preference for an isolated EBNF definition. By the way, as you will see in the examples, DParser uses no separate tokenization pass, just straight parsing. You can control white space recognition (that separates parse symbols) by defining the reserved d_whitespace() function; this enables you to tokenize however you like.

Finding a longest production

As a first example of a DParser for Python program, I created a grammar that looks for several patterns that are sub-productions of one another. This grammar handles a problem similar to the "dangling else" issue for many parsers. Specifically, how do you know when to stop looking for larger productions? (For example, is an "if" followed by an "else"?) My grammar looks at phrases that might contain words starting with "a," with "b," and with "c," in that order. All non-included words are just part of the "head" or "tail" of the phrase. A few examples illustrate this. First, the program itself:


Listing 1. Parser abc.py
#!/usr/bin/env python2.3
"Identify sequence of a-, b-, c- words"
#
#-- The grammar
def d_phrase(t, s):
    'phrase : words ( ABC | AB | A ) words'
    print "Head:", ''.join(s[0])
    print t[1][0]+":", ''.join(s[1])
    print "Tail:", ''.join(s[2])
def d_words(t):
    'words : word*'
def d_word(t):
    'word : "[a-z]+" '
def d_A(t):
    '''A : "a[a-z]*" '''
    return 'A'
def d_AB(t):
    '''AB : A "b[a-z]*" '''
    return 'AB'
def d_ABC(t):
    '''ABC : AB "c[a-z]*" '''
    return 'ABC'
#
#-- Parse STDIN
from dparser import Parser
from sys import argv, stdin
phrase, arg = stdin.read(), argv[-1]
Parser().parse(phrase,
               print_debug_info=(arg=='--debug'))

Let's run the parser against a few phrases to illustrate:


Listing 2. Simple to parse phrases
$ echo -n "alpha" | ./abc.py
Head:
A: alpha
Tail:
echo -n "xavier alpha beta charlie will" | ./abc.py
Head: xavier
ABC: alpha beta charlie
Tail: will
$ echo -n "mable delta xavier bruce" | ./abc.py
Traceback (most recent call last): [...]
dparser.SyntaxError:
syntax error, line:1
mable delta xavier bruce[syntax error]

So far, so good, apparently. My grammar finds an ABC when it is available, but settles for an A, or AB, when that is all there is to find.

But in truth, my grammar has a lot of problems when it encounters ambiguous phrases. In most cases, it appears that DParser will fall into an infinite loop when it cannot decide how to parse a phrase (probably the worst outcome; at least a traceback or reported error would tell you what went wrong). Sometimes (on my Mac OSX machine, at least), it produces a "Bus error" instead. I do not like either of those cases.


Addressing ambiguity

Since all of the terminal productions have the same priority, the parser cannot decide how to parse something like:


Listing 3. Trying to parse an ambiguous phrase
$ echo -n "alex bruce alice benny carl" | ./abc.py

Is this an AB followed by words? Is it words followed by ABC? For that matter, is it all words (made of five word productions), and should it raise a dparser.SyntaxError? I wind up with a "Bus error" or a frozen task, rather than a parse. In the prior examples, the ambiguity happened to work out because of the eagerness of each production; once an ABC was found, the leading and trailing words fell into place.

It is quite confusing, actually, to understand exactly why the prior grammar works in the cases that it does -- more confusing, in a way, than understanding why it sometimes does not work.

Let's say that our desire in parsing a phrase is to find an ABC production whenever one exists, even if some other production (that is, an AB) is satisfied first in left-to-right traversal. We can do that by elevating the priority of the ABC terminal production:


Listing 4. Revised d_ABC() production function for abc2.py
def d_ABC(t):
    'ABC : AB "c[a-z]*" $term 1'
    return 'ABC'

If a priority is not specified, a production is priority 0. Otherwise, any positive or negative integer may be used to rank productions. Now we can run:


Listing 5. Successfully finding late ABC
$ echo -n "alex bruce alice benny carl" | ./abc2.py
Head: alex bruce
ABC: alice benny carl
Tail:

Notice that alternatives within the (ABC|AB|A) alternation are all tried before the parser looks for the trailing words. So this succeeds without any priority specification:


Listing 6. No ambiguity issue between A and AB
$ echo -n "alex alice benny" | ./abc.py    Head: alex
AB: alice benny
Tail:

I found some difficult-to-explain anomalies in the behavior of DParser in the face of ambiguity. For example, adding a trailing word that is definitely not an A convinces the parser to work -- but only if ran with debugging information!


Listing 7. Erratic behavior in face of ambiguity
$ echo -n "alex bruce alice benny carl dave" | ./abc.py
[...process freezes...]
$ echo -n "alex bruce alice benny carl dave" | ./abc.py --debug
[...debugging trace of speculative and final productions...]
Head: alex bruce
ABC: alice benny carl
Tail: dave

The priority specification in abc2.py resolves the parse in either case.

It is rather subtle and hard to understand exactly when ambiguities are resolved. Basically, productions are fulfilled as they are seen, left-to-right, and each one tries to grab as many words as it can, left-to-right. Backtracking only happens when something is unambiguously wrong in a lookahead. Roughly, anyway.


A little on debugging

One thing I like about DParser is its option to display debugging information. Seeing this does not necessarily make it obvious how to create a correct grammar, but at least it provides nice insight into the actions a parser takes when processing a particular phrase. For example:


Listing 8. Showing a trace of speculative productions
#------- Showing a trace of speculative productions
$ echo -n "alex alice benny carl dave" | ./abc2.py --debug
               d_words ???:
                   d_A ???:     alex
                d_word ???:     alex
               d_words ???:
              d_phrase ???:     alex
               d_words ???:     alex
                   d_A ???:     alice
                d_word ???:     alice
               d_words ???:
               d_words ???:     alice
              d_phrase ???:     alex alice
              d_phrase ???:     alex alice
               d_words ???:     alex alice
                d_word ???:     benny
                  d_AB ???:     alice benny
               d_words ???:     benny
               d_words ???:     alice benny
               d_words ???:
              d_phrase ???:     alex alice benny
              d_phrase ???:     alex alice benny
              d_phrase ???:     alex alice benny
               d_words ???:     alex alice benny
                d_word    :     alex
               d_words    :     alex
                   d_A    :     alice
                  d_AB    :     alice benny
                 d_ABC ???:     alice benny carl
               d_words ???:
              d_phrase ???:     alex alice benny carl
                 d_ABC    :     alice benny carl
                d_word ???:     dave
               d_words ???:     dave
              d_phrase ???:     alex alice benny carl dave
                d_word    :     dave
               d_words    :     dave
              d_phrase    :     alex alice benny carl dave
Head: alex
ABC: alice benny carl
Tail: dave

The productions followed by question marks are speculative attempts; those without are final productions actually followed. Related to this, DParser gives you the capability to take actions differentially when a production is entered either speculatively or on a final parse. By default, the actions in a function body are only performed on a final parse. But you may specify either of two extra arguments to a production to handle speculative parses. (There are also a number of other optional arguments not discussed in this article.)


Listing 9. Actions during speculative parsing
def d_prod1(t, spec_only):
    'prod1 : this that+ other?'
    print "Speculative parse of prod1"
def d_prod2(t, spec):
    'prod2: spam* eggs toast'
    if spec:
        print "Speculative parse of prod2"
    else:
        print "Final parse of prod2"

Of course, all the information displayed by my speculative productions is also displayed (in slightly different form) by using the print_debug_info argument to dparser.Parser.parse(). But you could also decide to take other actions -- trigger external events, for example.


More on priorities

The earlier use of an assigned priority for the ABC production was a little bit of a hack, I confess. But in cases of straightforward ambiguity, fine tuning terminal priorities is a great tool. Let me give another grammar with a straightforward ambiguity:


Listing 10. Term-for-term ambiguous grammar, ibm.py
def d_phrase(t, s):
    'phrase : word ( vowel | caps | threeletter ) word'
    print "Head:", ''.join(s[0])
    print t[1][0]+":", ''.join(s[1])
    print "Tail:", ''.join(s[2])
def d_word(t): 'word : "[A-Za-z]+" '
def d_vowel(t):
    'vowel : "[AEIOUaeiou][A-Za-z]*"'
    return 'VOWEL'
def d_caps(t):
    'caps : "[A-Z]+"'
    return 'CAPS'
def d_threeletter(t):
    'threeletter : "[A-Za-z][A-Za-z][A-Za-z]"'
    return '3LETT'
#-- Parse STDIN
from dparser import Parser
from sys import stdin
Parser().parse(stdin.read())

The productions for vowel, caps, and threeletter are not necessarily distinct; they can all grab an overlapping set of words. For example:


Listing 11. When DParser detects ambiguity gracefully
$ echo -n "Read IBM developerWorks" | ./ibm.py
Traceback (most recent call last): [...]
dparser.AmbiguityException:  [...]

Of course, you might get lucky with a particular phrase:


Listing 12. Fortuitous avoidance of parse ambiguity
$ echo -n "Read GNOSIS website" | ./ibm.py
Head: Read
CAPS: GNOSIS
Tail: website

Rather than hope for good luck, let's explicitly indicate a priority between productions:


Listing 13. Resolving ambiguous term, ibm2.py
def d_vowel(t):
    'vowel : "[AEIOUaeiou][A-Za-z]*" $term 3'
    return 'VOWEL'
def d_caps(t):
    'caps : "[A-Z]+" $term 2'
    return 'CAPS'
def d_threeletter(t):
    'threeletter : "[A-Za-z][A-Za-z][A-Za-z]" $term 1'
    return '3LETT'

Now every phrase would like to identify middle word types in a specific order (only of ones possible, of course):


Listing 14. Disambiguated parse results
$ echo -n "Read IBM developerWorks" | ./ibm2.py
Head: Read
VOWEL: IBM
Tail: developerWorks
$ echo -n "Read XYZ journal" | ./ibm2.py
Head: Read
CAPS: XYZ
Tail: journal


Making a decision

Despite receiving recommendations from several readers, I am only lukewarm about DParser. It has quite a few powerful switches and options to productions that I have not discussed -- for example, specifying associativity. Overall, the DParser language is quite robust, and I strongly suspect that DParser for Python will run significantly faster than most pure-Python parsers.

However, I still cannot work up a great deal of enthusiasm for the function docstring style of parsers. Obviously, many excellent Python programmers disagree with me on this. But I also found some of the parse results a little bit mystifying: Why succeed in debug mode, but not standard mode? Exactly when does ambiguity come up? Developing a grammar with any parsing tool produces similar surprises, but I found DParser's somehow particularly surprising; SimpleParse, for example, surprised me less. Probably if I knew more about the ins-and-outs of the underlying parsing algorithms, it would make more sense; in my relative ignorance, I am probably a lot like 95%+ of my readers, though. There are people who know more about parsing than I do; but in truth, most programmers still know less.


Resources

  • Check out the DParser homepage on SourceForge.

  • Documentation of the specifics of the Python binding are available. Many of the advanced features, however, are best documented in the DParser library documentation -- the Python wrapper is fairly thin.

  • Several earlier Charming Python installments have looked at other parsers for Python. Spark and PLY are similar to DParser in using docstrings of functions to describe productions. mx.TextTools is a lower-level state machine, and SimpleParse makes use of a non-Python EBNF syntax to describe grammars (and translates it to fast mx.TextTools operations behind the scenes).

  • Browse all of the Charming Python columns on developerWorks.

  • David's Text Processing in Python (Addison Wesley, 2003) discusses both SimpleParse and mx.TextTools (borrowing from my corresponding developerWorks articles), but adds a discussion of PLY. You can read the book online, or buy it at fine bookstores.

  • Find more resources for Linux developers in the developerWorks Linux zone.

  • Browse for books on these and other technical topics.

  • Develop and test your Linux applications using the latest IBM tools and middleware with a developerWorks Subscription. You get IBM software from WebSphere, DB2, Lotus, Rational, and Tivoli, and a license to use the software for 12 months, all for less money than you might think.

  • Download no-charge trial versions of selected developerWorks Subscription products that run on Linux, including WebSphere Studio Site Developer, WebSphere SDK for Web services, WebSphere Application Server, DB2 Universal Database Personal Developers Edition, Tivoli Access Manager, and Lotus Domino Server, from the Speed-start your Linux app section of developerWorks. For an even speedier start, help yourself to a product-by-product collection of how-to articles and tech support.

About the author

David Mertz

David Mertz has been writing the developerWorks columns Charming Python and XML Matters since 2000. Check out his book Text Processing in Python. For more on David, see his personal Web page.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in

If you don't have an IBM ID and password, register here.


Forgot your IBM ID?


Forgot your password?
Change your password


By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. This profile includes the first name, last name, and display name you identified when you registered with developerWorks. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)


By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Linux
ArticleID=15037
ArticleTitle=Charming Python: A look at DParser for Python
publish-date=07282004
author1-email=mertz@gnosis.cx
author1-email-cc=tomyoung@us.ibm.com

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

For articles in technology zones (such as Java technology, Linux, Open source, XML), Popular tags shows the top tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), Popular tags shows the top tags for just that product zone.

For articles in technology zones (such as Java technology, Linux, Open source, XML), My tags shows your tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), My tags shows your tags for just that product zone.

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).