Parsing with Perl modules

The right tools simplify the tasks and extend the grammars

One of Perl's main goals is parsing text. This tutorial discusses CPAN modules for text parsing, and shows how you can use them easily in your own programs. Analyzing code comments, adapting existing lex grammars, and many other tasks can be easy with the right tools. Teodor shows examples of each one, with an eye to real-world programming.

Teodor Zlatanov (tzz@iglou.com)Gold Software Systems

Teodor Zlatanov graduated with an M.S. in Computer Engineering from Boston University in 1999. He has worked as a programmer since 1992, using Perl, Java, C, and C++. His interests are in open source work on text parsing, 3-tier client-server database architectures, UNIX system administration, CORBA, and project management. He can be contacted at tzz@iglou.com.



01 April 2000

Also available in Japanese

Perl is an excellent language for text analysis. The built-in operators make text searching, replacing, and pattern matching effortless. When programmers learn Perl, they frequently try to write their own routines to parse text or data. Fortunately, CPAN (Comprehensive Perl Archive Network; see Resources) has a large collection of modules, some of which relieve you from the problems of text and data analysis.

Using Perl modules for parsing, lexing, and analysis

Parse::RecDescent, by Damian Conway, is a powerful tool for lexing and parsing text. Lingua::EN::Fathom, by Kim Ryan, can analyze a file or a block of text, and produce various statistics about its input. See Resources for both tools.

The downside of Parse::RecDescent is that it is moderately slow because it uses extensible grammar rules and on-the-fly lexing and parsing. If the module is used badly, its performance degrades. The upside is that Parse::RecDescent shines in lexing and parsing. The next section shows a lex grammar ported to Parse::RecDescent's grammar. lex will always do its job better than any other tool. The chef.pl script (see Resources), which runs slower than a compiled C program using the chef.l lex grammar, can do much more. Make sure you know your tools and use the appropriate one for the job.


Adapting existing lex grammars

The Swedish Chef lex grammar, by John Hagerman, is a great example of a simple text filter. It's also a lot of fun, having amused many computer science and engineering students on the eve of their finals. I will show an example of porting the chef.l grammar to Perl using the Parse::RecDescent module (not the ideal choice for this task -- the Parse::Lex module would be a better one). This section is intended only as an introduction to the rules of building a Parse::RecDescent syntax and will include actions, remembering the state, rejecting productions, and lexing text. Remember to try the chef.pl script yourself -- you may just find yourself addicted.

The chef.pl script is an almost exact copy of the chef.l lex grammar. The $niw variable is set to 0 at startup, because many rules test it to see if they should be accepted or rejected. $niw stands for "not in word", and it is set to 1 when the parser is inside a word. The directive to Parse::RecDescent will reject a rule if the variable named in the directive is non-zero. So keep in mind that $niw = 0 means that you are not inside a word.

The skip variable was set to '' (empty string), so all input, including spaces, goes to the token directive. Also, the chef rule ends on \z, which is the end of the string. Usually, \Z is used, but that can also match a newline in Perl, and those may also be in the input.

The chef rule: The grammar begins with the chef rule. The chef rule matches a number of tokens, up to the \z end of string. Those two elements of the chef rule are called "productions." Any rule must be made up of productions. An action can be part of a production; it is marked by braces {} and contains Perl code. It doesn't match anything -- it is simply executed.

The token rule: The token rule can match any of a number or sequences, somewhat arbitrarily, which I named to match the chef.l grammar. I'll explain a few examples, so that the grammar correspondence is clear.

The basic grammar definitions of a word/non-word character:
chef.pl: WC: /[A-Za-z']/
chef.pl: NW: /[^A-Za-z']/ 
chef.l:  WC  [A-Za-z']
chef.l:  NW  [^A-Za-z']

The an rule: The simplest rules do not depend on anything. The an rule is a good example: Whenever it sees 'an', it prints 'un'. Also, it sets $niw to 1 (remember, that means you are inside a word).

chef.pl: an: /an/ { $niw = 1; print 'un' }	
chef.l:  "an"		{ BEGIN INW; printf("un"); }

The ax rule: The next more complex rule is the ax rule. It says, if an 'a' shows up, and is followed by a word character WC, print 'e'. The ...WC production syntax means that a word character must follow the a, but will not be consumed in the match. Thus, 'aan' produces 'eun' with the an and ax rules. The rule sets $niw to 1 (inside a word).

chef.pl: ax: /a/ ...WC { $niw = 1; print "e" } 
chef.l:  "a"/{WC}	{ BEGIN INW; printf("e"); }

The en rule: The en rule works exactly like the ax rule, but with a NW (non-word) production anticipated to follow. This means that the 'en' must be at the end of a word.

chef.pl: en: /en/ ...NW { $niw = 1; print "ee" }
chef.l:  "en"/{NW}	{ BEGIN INW; printf("ee"); }

The ew rule: The ew rule succeeds only if you are inside a word. That's why you reject it if $niw is 0.

chef.pl: ew:  /ew/ { $niw = 1; print "oo" }
chef.l:  "ew"	{ BEGIN INW; printf("oo"); }

The i rule: The i rule will succeeds only if you are inside a word, and have not seen another i yet. It augments $i_seen to 1, and $i_seen is only set back to 0 if a non-word character or a newline are seen.

chef.pl: i:   /i/ { $niw=1;$i_seen=1; print "ee" }
chef.l:  "i"	{ BEGIN INW; printf(i_seen++ ? "i" : "ee"); }

The end of sentence rule: The end of sentence markers [.!?] in any quantity, followed by space, will be printed, followed by the famous (or infamous, take your pick) "Bork Bork Bork!" message. The actual behavior deviates slightly from the original chef filter, only because I like it better that way (one can never have enough Bork messages). The $item[1] syntax means that the spaces won't be matched, since they are known as $item[2] to Parse::RecDescent.

chef.pl: end_of_sentence: /[.?!]+/ /\s+/
         { $niw = 0; $i_seen = 0;
           print $item[1] . "\nBork Bork Bork!\n" }
chef.l:  [.!?]$
         { BEGIN NIW;
           i_seen = 0;
           printf("%c\nBork Bork Bork!",
           yytext[0]);
         }

Extending your grammar dynamically

The extend-grammar.pl script, which evolved from the demo_selfmod.pl script supplied with the Parse::RecDescent module, demonstrates an extensible grammar. Initially, the grammar consists of only one proper name, Ted (modesty aside, it's a great name). The name rule matches 1 to n word characters (according to Perl's \w syntax). The do_you_know rule matches the words "do you know" separated by any number of spaces, and in any combination of upper and lower case. This is where the Perl i match modifier comes in handy, so you can express simple concepts in a simple way.

proper_name: /Ted/
  name: /\w+/
  do_you_know: /do/i /you/i /know/i

The extend-grammar process rule: The process rule can consist of a query or a definition. Unless it finds one or the other, it triggers a message in the main loop.

  process: query | definition

... and later ...

while (<>)
{
 $parse->process($_)
   or print "Enter a query (do you know ...)"
            "or a definition (... exists)\n";
}

The extend-grammar query rules: The query rule consists of a do_you_know production, followed by a name or a proper name. For a name, the action is to print a message that it is unknown. For a proper name (as defined by the proper_name rule), the action is to print out a message that it is known. Two rules of the same name are equivalent to one rule with two alternative productions, so:

query: do_you_know proper_name
           { print "I know " .
                   $item{proper_name} .
                   ",sure!\n"
           }
query: do_you_know name
           { print $item{name} .
                   " does not exist in my little world",
                   ", sorry.\n"
           }

is equivalent to:

query: do_you_know proper_name
           { print "I know " .
               $item{proper_name} .
               ",sure!\n"
           }
       | do_you_know name
           { print $item{name} .
               " does not exist in my little world",
               ", sorry.\n"
           }

The extend-grammar definition rule: The heart of this extensible grammar is the definition rule. If a name is followed by 'exists', then the action will extend the parser with a new rule for proper_name. You can modify the very grammar you are executing, while it's running.

definition: name /exists/i
         {
            $thisparser->Extend("proper_name: '$item{name}'");
            print "\"$item{name}\" is now a valid proper name\n";
         }

$thisparser is a reference to the parser that's executing the grammar actions. In addition to Extend, you can also use a Replace method (think of #ifdef statements in C, for example) to change a rule's contents.

To play with extend-grammar.pl (see the full listing in Resources), just run it and type either "do you know" or "exists". Anything else is not matched by the parser's process rule, and so is rejected. The proper_name rule begins with 'Ted', a known proper name.


Analyzing C/C++ source code comments for readability

The stat-comments.pl script (see the listing in Resources) uses the demo_decomment.pl script's grammar for parsing C/C++ code to extract comments. The demo_decomment.pl script comes with the Parse::RecDescent module.

In addition, the stat-comments.pl script uses the Lingua::EN::Fathom module to analyze the comments parsed out by the Parse::RecDescent grammar.

First, stat-comments.pl creates the grammar (in the $Grammar variable, in a BEGIN block placed at the end of the script). The grammar's program rule returns a hash reference with the code, comments, and strings available inside it as text. For the rest of the grammar, see the Parse::RecDescent documentation. (The C parsing grammar also works with the Java program's source code.)

Then, stat-comments.pl reads in either the text input, or the sample data provided at the end of the script, and checks to see if comments were obtained at all. Setting $/ to undef has the effect of reading all the input into $text at once, newlines included.

my $text = @ARGV ? <> : ;
my $parts = $parser->program($text) or die "malformed C program";
# only work with comments of length > 0
die "No comments found in input" unless length $parts->{comments};

Next, stat-comments.pl converts comment marks to periods, so comments are separated by periods. This is done only for neatness, and won't have much of an effect on the final statistics. Finally, a Lingua::EN::Fathom object is created, and a wrapped block of text (see Text::Wrap) is passed to it for analysis. The report is then printed out.

$parts->{comments} =~ s#//#. #g;
$parts->{comments} =~ s#/\*#. #g;
$parts->{comments} =~ s#\*/#. #g;

# we can now evaluate the comments (stored in $parts->{comments})
my $fathom = new Lingua::EN::Fathom; 
$fathom->analyse_block(wrap('', '', $parts->{comments}));

# voila, the readability report!
print($fathom->report);

The potential applications of the stat-comments.pl script are in code quality control. It is a well known fact that well documented programs are easier to maintain, and many organizations find this highly desirable. The stat-comments.pl script won't distinguish between good and bad comments. It does, however, tell the code manager if comments from a programmer are unusually terse, unusually verbose, or difficult for a nonprogrammer to understand. Try it for yourself: Run stat-comments.pl with and without the long comment block that begins with "We should raise...", and see the difference in statistics. Clearly, software project managers can use these statistics to their advantage.

The stat-comments.pl script is a valuable software project management tool, but only if properly used. Just like function point counting, lines of code per day, and many other statistics available to a software project manager, the statistics produced by stat-comments.pl must be seen as a complement to the real world, not the essence thereof. Many good programmers comment badly, and many bad programmers produce excellent comments. What is important is understanding their modus operandi, and recognizing patterns and changes in it.


Final notes

The Perl modules available today can make any parsing task easier. In addition to Parse::RecDescent, there's also Parse::Lex, Parse::CLex, and Parse::Yapp, available from CPAN. Check them out, and see which one is most appropriate for your situation. Parse::RecDescent is the most flexible and powerful module in this list.

For an example of Parse::RecDescent put to great use, see the CPAN RFC::RFC822::Address by Abigail. Without Parse::RecDescent, syntactic validation of an RFC 822 e-mail address is almost impossible, even with the power of Perl's regular expressions.

Text analysis is difficult if you try to create your own tools. The CPAN Lingua modules can give you power and flexibility in any text analysis task. Visit the CPAN Web site (see Resources) for many other Lingua modules.

This article shows how Perl code can be effectively reused. The demonstration scripts that come with the Parse::RecDescent module provided the basis for two of the three scripts presented here. Thanks go to Damian Conway for the Parse::RecDescent module, Helmut Jarausch for the demo_decomment.pl script, Kim Ryan for the excellent Lingua::EN::Fathom, and John Hagerman for the original chef filter grammar.

Resources

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Linux on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Linux
ArticleID=11002
ArticleTitle=Parsing with Perl modules
publish-date=04012000