Skip to main content

Cultured Perl: Perl 6 grammars and regular expressions

Comparing the Perl 6 grammars and regular expressions with the Perl 5 Parse::RecDescent module

Teodor Zlatanov (tzz@bu.edu), Programmer, Gold Software Systems
Author photo
Teodor Zlatanov graduated with an M.S. in computer engineering from Boston University in 1999. He has worked as a programmer since 1992, using Perl, Java™, C, and C++. His interests are in open source work, Perl, text parsing, three-tier client-server database architectures, and UNIX system administration. Suggestions and corrections are welcome; contact Ted at tzz@bu.edu.

Summary:  Perl 6 is finally coming within reach. In this article, Ted gives you a tour of the grammars and regular expressions of the Perl 6 language, comparing them with the currently available Parse::RecDescent module for Perl 5. Find out what will be new with Perl 6 regular expressions and how to make use of the new, powerful incarnation of the Perl scripting language.

Date:  02 Nov 2004
Level:  Intermediate
Activity:  2784 views

For any Perl programmer, the Perl 6 project is a hot topic. Perl has always been an evolving language, and Perl 6 has definitely evolved from Perl 5 in almost every way imaginable (but you can still tell they come from the same camel). Perl 6 will run on top of Parrot, a versatile virtual machine that will be able to load and interpret not only Perl 6 bytecode, but many other languages as well.

Don't let the future tense above worry you. If you've ever seen a building being built over several months, you know that after the foundation is dug, the metal skeleton seems to stand forever. Workers come and go, and there's activity, but the facade is the same old ugly rusty metal. Then suddenly, over a few days, the building is finished. The Perl 6 project is at that long interim phase, rusty metal showing and workers hidden deep inside. If you want an inside look at the project's progress, check out the latest Parrot release and the weekly Perl 6 updates (see Resources for a link).

This article gives you a tour of the grammars and regular expressions of the Perl 6 language, comparing them with the currently available Parse::RecDescent module for Perl 5. Previous knowledge of Perl 5, some familiarity with Parse::RecDescent, and experience with lexing and parsing will help you greatly through this article, but it is written for any Perl programmer interested in the Perl 6 grammars and regular expressions.

Perl 6 regular expressions and grammars overview

One thing needs to be stated right away: Perl 6 will support Perl 5 regular expressions by using the :p5 modifier. This will be a blessing for those who are not interested or not willing to convert to the Perl 6 regular expressions. Furthermore, the Perl 6 regular expressions can be, but don't have to be, radically different from their Perl 5 counterparts.

Perl 6 regular expressions can be reused when necessary. Reusing regular expressions to match a single word is ridiculous; reusing them when parsing a configuration file is almost a necessity (depending on the complexity of the configuration syntax, how often it changes, and so on).

Already the Regexp::Common module (see Resources) attempts to reuse regular expressions in Perl 5, but they have to be hidden behind a module interface because Perl 5 does not allow regular expression reuse. Perl 6 allows this reuse from the start.

While you can write illegible and dense Perl 6 regular expressions just like Perl 5 ones, whitespace comments are turned on by default; so, whereas in Perl 5 you can match "hello there" with "hello there" itself, in Perl 6 you'd have to ask for /hello <sp> there/ instead. This allows for clear separation of terms in the regular expression.

More importantly, Perl 6 regular expressions, when used in rules inside grammars, are by necessity less dense. The programmer will (I hope, and so does Larry Wall) recognize that Listing 2 is far easier to understand and maintain than Listing 1:


Listing 1. Regular expressions without grammar
# note this is just a language example, not an accurate name matcher
# Perl 6 <[A-Z]> is equivalent to the Perl 5 [A-Z]
# Perl 6 :w modifier surrounds all tokens with "automagic" whitespace,
# which basically means it will match what most people would call
# "words"
$name = m:w/ <[A-Z]><[a-z]>+ <[A-Z]><[a-z]>+ /;


Listing 2. Regular expressions as rules in a grammar
# note this is just a language example, not an accurate name matcher
grammar English
{
 rule name :w { <singlename> <singlename> };
 rule singlename { <[A-Z]><[a-z]>+ };
};

Not only is Listing 2 far more readable, it is also easier to maintain. For instance, Perl 6 comes with the <upper> and <lower> rules already defined, which will make things easier:


Listing 3. Regular expressions as rules in a grammar, improved
# note this is just a language example, not an accurate name matcher
grammar Names
{
 rule name :w { <singlename> <singlename> };
 rule singlename { <upper><lower>+ };
};

Voila! We have just reused code when we used <upper> and <lower>. Furthermore, we can now handle Unicode names as well, whereas before we were strictly limited to names that begin with A through Z. Code reuse is a wonderful thing.

Further maintenance will almost certainly be necessary to correct, for instance, dashes in names or other names (such as Don Quixote de la Mancha). Again, notice how easy it is to isolate the changes to a single rule, or create a new rule if necessary.

Grammars are a fairly simple concept. They are packages with a private namespace and private subroutines; each subroutine is called a rule. Grammars can inherit from other grammars. That allows the programmer both to reuse other people's code and to write reusable code. The value of such reuse is obvious from the success of the CPAN archive for Perl modules. The Perl 6 grammars use regular expressions in rules, and then those rules can be used inside other rules.


Comparing Parse::RecDescent with Perl 6 grammars

Those familiar with Parse::RecDescent know it's a powerful tool. It's a Perl 5 module that can produce very powerful grammars with just a little code. How similar are those grammars to Perl 6 grammars? Well, the author of Parse::RecDescent is Damian Conway, who is also heavily involved with Perl 6. It's hardly surprising that a number of ideas that were shown to work well with Parse::RecDescent made it into Perl 6. Some of the syntax slithered through as well.

Parse::RecDescent (P::RD hereafter) uses the new() module method to create a new grammar. Each P::RD grammar becomes an object blessed into the P::RD class, with each rule in the grammar as a method you can use to cause actions to happen. P::RD grammars can associate an action with every rule as an integral part of the parsing process. This is very important. In Perl 5, parsing is an event unto itself, and actions are roadkill on the way to the Big City, using an extended syntax that is proven to confuse cats at a distance. This difference makes P::RD much more effective than Perl 5 regular expressions at making something happen when a match is detected.

Perl 6 grammars learned the P::RD lesson that actions are useful, and now actions are first-class citizens. Everywhere a match can be found, an action (code block) can be executed. Even the contents of the thing being matched can be modified! Furthermore, the syntax for those actions is as simple as it is in P::RD.


Listing 4. Parse::RecDescent grammar with actions
# small extract from my cfperl.pl program's global parser

my $parse_global = new Parse::RecDescent (q{
  input:  blank | comment | class | section

  comment: /^\s*/ '#' { 1; }
  blank: /^\s*$/ { 1; }

  section: /\w+/ ':'
   { $::current_section = $item[1];
     $::current_classes = 'any'; 1;
   }

  class: compound_class '::'
   { $::current_classes = $item{compound_class}; 1; }

  compound_class: /[-!.|\w]+/
});

$parse_global->input("TEXT GOES HERE");

The preceding grammar has a single rule, input, which will match either blank, comment, class, or section. Each one of those rules has a definition, either standalone or based on another rule or both.

Note the actions enclosed in { } braces, like a normal code block. For a section, the actions set the global variable $current_section to the section just matched and reset the $current_classes global variable. For a class, the action sets the global $current_classes variable to the item matched.

How would this look in Perl 6?


Listing 5. Perl 6 translation of Listing 4 grammar
# this may be buggy - it's certainly untested
# every input is known to be one line, without newline characters
grammar Global
{
 rule input { <blank> | <comment> | <class> | <section> }

 rule blank { ^^ \s* $$ }

 rule comment { ^^ \s* \# }

 rule section
 { (\w+) \s* \:
  {
   $::current_section = $1;
   $::current_classes = 'any';
  }
 }

 rule class { (<compound_class>) \s* \:\:
  {
   $::current_classes = $1;
  }
 }

 rule compound_class { <[-!.|\w]>+ }
}


Perl 5 regular expressions

If you are very familiar with Perl 5 regular expressions, skip this section.

The Perl 5 regular expressions are familiar to any Perl 5 programmer. They are marked by the m// operator (sometimes optional) when matching, and by the s/// operator when matching and replacing. The / character can be replaced by others in certain cases, and there are special operators that are sort of like regular expressions, but not quite (tr///, for example). The point of Perl 5 regular expressions is to say either "look for this," or "look for this and replace it with that."

Seems simple, doesn't it? Well, it usually is. There are modifiers to the regular expressions, both inside and outside the expression: case-insensitivity, lookahead, multiple matches, ignoring whitespace, number of matches, and so on. You can even precompile a regular expression for speed increases.

Let's ignore the more complex options and just look at the basic syntax of a regular expression. Here are some examples:


Listing 6. Perl 5 regular expression examples
# 1: look for "color"
# matches "green and red are colors"
m/color/

# 2: look for "color" at the beginning of the line
# matches "color me blue" but not "this is my color"
m/^color/

# 3: look for "frog" followed by anything, followed by "jump"
# matches "the frog jumped" but not "jump, you frog"
m/frog.*jump/

# 4: look for a numeric digit, then 1 or more spaces, then another digit
# matches "671 2" but not "numbers 444, 222"
m/\d\s+\d/

# 5: save the first number seen (multiple digits)
# matches AND returns "46755332" but not "how do you do?"
m/(\d+)/

# 6: replace "wall" with "plaster" everywhere
s/wall/plaster/g

# 7: replace the FIRST number seen with N
s/\d+/N/

The first thing you notice is that these regular expressions are all on one line. Perl 5 regular expressions can span multiple lines and contain comments, but most programmers don't bother with those things.

Even with multiple lines and comments, it would be generous to call Perl 5's regular expressions dense -- they are more akin to line noise to a beginning Perl programmer. But in that density hides a wealth of information. Damian Conway calls Perl 5 regular expressions "arcane, baroque, inconsistent, and obscure," which shows he is biased in their favor. In my opinion, the density of Perl 5's regular expressions is one of the reasons why the language is so powerful, while at the same time fairly hard to maintain. The challenge for Perl 6 is to preserve the power but streamline the syntax.

If it seems like I'm harping on readability, it's for a good reason: new Perl programmers are intimidated by Perl 5 regular expressions. I have seen it many times in various newsgroups and mailing lists. When a language feature scares its users, it's time for a change.

Perl 5 regular expressions lack more than readability -- they lack structure and reusability. These are attributes commonly associated with higher-level programming constructs, and we'll see how Perl 6's regular expressions address these three problems. The main one, however, is readability.


Perl 6 parsing and lexing

Perl 5 offers no built-in parsing facilities, but it does offer lexing facilities through regular expressions.

A brief tutorial on parsing and lexing is in order. Lexing, to give one definition, is the act of breaking up input into meaningful words (also called lexing tokens or lexemes). It can be more or less than that, depending on the particular implementation, but the general idea is that given some program as plain text, lexing can tell us the purpose and boundaries of each meaningful word in that text.

Regular expressions can encapsulate a range of lexing patterns, from simple fixed fields to fairly complex nested patterns. While lexing does not have to be done through regular expressions, they tend to be a handy vehicle through the jungle of programming and data languages. If humans wrote programs and data in fixed-format text, lexing would be very easy -- but we don't.

When lexing is finished, a meaningless stream of data has been transformed into a sequence of words meaningful to the parser. The parser is the software that takes those meaningful words and builds a parse tree out of them by recognizing their type and purpose.

This is actually a very simple idea in terms of our own knowledge of language. The lexer is the part of our language knowledge that says "this is a sentence; this is punctuation; twenty-three is a single word." The parser is the language knowledge that says "this sentence contains a verb, a subject, a few adjectives, and some pronouns." When parsing is done, the meaningless (to a computer) stream of data becomes something a computer can understand.

Here is an example of lexing and parsing a sentence. It's entirely made up, so please don't try to overanalyze it, but instead observe the separation of duties at each analysis layer.


Listing 7. Lexing and parsing
Sentence:
The sky is blue, wow!

Lexer: [The] [sky] [is] [blue] [%%comma%%] [wow] [%%exclammation%%]

(Note how the punctuation was inserted with special symbols.)

Parser:

Declarative Sentence =
 Specific_Subject + Verb + Adjective + Optional_Exclammation =
  [The sky]         [is]     [blue]          [, wow!]

The parser in this made-up example does not care about the exact contents of the declaration, because it's tangential to the sentence. It cares very much, however, about whether a subject is specific ("the sky") or not ("sky") because the semantic content of the sentence is changed radically by that difference.

The Parse::RecDescent module employs Perl 5's regular expressions to do lexing seamlessly (whenever a P::RD rule does not use other rules to do matching, it's a safe bet it's a lexer rule), and builds parsing facilities on top. Thus, P::RD on top of Perl 5 is a powerful parser and lexer combination.

As I already mentioned, the ties between Perl 5, P::RD, and Perl 6 run deep. It's no surprise that parsing and lexing in Perl 6 is done in a very similar way to the Perl 5 and P::RD combination described above. The Perl 6 regular expressions have been beefed up and can include other regular expressions as well, which makes them reusable even without grammars. For lexing purposes, this means that the lexer definition of an integer number, for example, can be used to build the lexer definition of a real number and a fraction.

On top of the Perl 6 regular expressions, and tightly integrated with them, are the Perl 6 grammars. Like P::RD, those grammars use simple rules to do lexing, and then use more complex rules to parse the lexed input. The Perl 6 grammars and regular expressions are, therefore, a complete solution to the lexing and parsing needs of the Perl community, and are based on the proven approach to lexing and parsing seen in Perl 5 when the P::RD module is used.


Other Perl 6 regular expression and grammar tidbits

There are other interesting features in the Perl 6 regular expressions and grammars, which Perl 5 and P::RD do not offer.

Commit directives are very useful for optimizing parsing. They say that if a point in the parsing process is reached, nothing else except the current rule could match. For example, if the only word that could follow "color" in a language was "blue," the grammar would specify (in pseudocode) color commit() blue and would not bother parsing "color red." Perl 6 grammars have commit directives that specify that the current word, alternative, grammar rule, or everything up to the match operator is not to be backtracked after it matches. This fine-grained control is not in P::RD, which offers only one level of commit. For large grammars, this is a very useful feature. If this seems confusing, just remember that Perl 6 lets you decide when it's time to commit, and at what level, to the current match.

There is a counterpart to the commit directives, called fail. fail lets Perl 6 grammar rules fail because of a logical condition. A rule that uses fail to filter out invalid days of the month (without taking into account the actual month) would be rule date {:w <month> (\d+): { fail if $1 > 31 } }. The : after (\d+) tells Perl 6 that if "32" fails, it should not backtrack and try "3".

Perl 6 grammars and regular expressions allow non-grouping matches, which do not save their result. In Perl 5, the regular expression m/(a)(b|c)(d)/ will return "a" then "b" or "c" and then "d". What if we don't care about "b" and "c"? Tough luck: in Perl 5, you can't ignore an alternation without the somewhat esoteric and not always available regular expression ?: modifier. In Perl 6, the non-grouping match specifier of [b|c] can be used so that "b" and "c" are not saved. This can be a valuable optimization aid.

Perl 6 regular expressions can easily specify "get the Nth time this happens," which was not easily possible in Perl 5.

Perl 6 Unicode support is much better than that in Perl 5 when it comes to regular expressions.

Temporary grammar-local variables can be defined and set in grammar actions. These are hypothetical variables that will only have a value if the match succeeds.

And much more...


Conclusion

Check out the online resources for Perl 6 listed in the Resources section. This is important for Perl 6, because it is a project in progress.

I hope those who already use Parse::RecDescent in Perl 5 were convinced that the regular expression and grammar jump to Perl 6 will be far less traumatic than would seem warranted by the major version number change. For those who don't use P::RD right now, I hope the article has shown that it's a useful tool worth learning. Even if your interest is more in Perl 6, P::RD has so much in common with Perl 6's grammar facility that learning it is worthwhile.

Finally, I hope you are as excited about the features of Perl 6 as I am, and that your interest leads you to observe or even contribute to the Perl 6 project. Perl 6 is a community rewrite of Perl 5, written by people like you and me -- so I hope you'll join that community if you haven't already.

Thanks go to Damian Conway and Luke Palmer for kindly looking over this article.


Resources

About the author

Author photo

Teodor Zlatanov graduated with an M.S. in computer engineering from Boston University in 1999. He has worked as a programmer since 1992, using Perl, Java™, C, and C++. His interests are in open source work, Perl, text parsing, three-tier client-server database architectures, and UNIX system administration. Suggestions and corrections are welcome; contact Ted at tzz@bu.edu.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Linux
ArticleID=31709
ArticleTitle=Cultured Perl: Perl 6 grammars and regular expressions
publish-date=11022004
author1-email=tzz@bu.edu
author1-email-cc=

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Special offers