Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

developerWorks Community:

  • Close [x]

Building custom language parsers

Solving common parsing problems with ANTLR

Arpan Sen (arpan@syncad.com), Lead Engineer, Synapti Computer Aided Design Pvt Ltd
Arpan Sen is a lead engineer working on the development of software in the electronic design automation industry. He has worked on several flavors of UNIX, including Solaris, SunOS, HP-UX, and IRIX as well as Linux and Microsoft Windows for several years. He takes a keen interest in software performance-optimization techniques, graph theory, and parallel computing. Arpan holds a post-graduate degree in software systems. You can reach him at arpansen@gmail.com.
(An IBM developerWorks Contributing Author)

Summary:  There are certain things about ANTLR that, if understood, help in faster debugging and provide a fuller appreciation of how the tool works. Learn how to use ANTLR to create smarter parsing solutions.

Date:  11 Mar 2008
Level:  Intermediate PDF:  A4 and Letter (86 KB)Get Adobe® Reader®

Activity:  17525 views
Comments:  

Compilers

Before delving into the sources, a bit of the fundamentals of compilers is in order.

Compiler fundamentals

The parser always looks for the next token to match the grammar rules, and the lexer keeps supplying a token until it hits the end of the input stream. Internally, the ANTLR classes CharScanner (from which TLexer is derived) and TokenStreamSelector are both derived from the TokenStream class and have defined their own versions of the nextToken method, which keeps returning the next token from the input stream. The TParser class doesn't really care about the input stream, and it has got absolutely no sense of the fact that the input stream has switched and keeps calling the nextToken method of the associated TokenStream object (which could be a TLexer or TokenStreamSelector).


Understand the code

With this understanding, take another look at the code in Listing 4. At the beginning, the parser is initialized with a TokenStreamSelector object, which in turn is initialized with a TLexer object that's created to parse the first source file. On encountering an INCLUDE directive, a new Lexer object is created to parse the included file, the previous TLexer (and the input stream thereof) is pushed back inside a stack maintained as part of the TokenStreamSelector object, and you attach the newly created TLexer to the TokenStreamSelector for providing the next token. (Actually, the push method does this.)

The uponEOF method

The final piece of the puzzle is the uponEOF method that you've defined as part of the lexer. Actually, if you look into the generated ANTLR code, you'll see that you're actually redefining the uponEOF method for the TLexer class, which has an empty body in the CharScanner class from which TLexer is derived. This method is called from the nextToken method of the TLexer class when it hits the end of an input stream. (Check out the ANTLR-generated code to understand this in further detail.)

So, what does uponEOF do? It simply switches back to the previous input stream by calling the pop method when it hits on the end of the current input stream. Is that all? Nope: Remember that this method is being called from within the lexer, which in turn is being called by the parser to provide the next token. So, you must arrange for the TokenStreamSelector::nextToken to return the next token from the switched stream.

For this purpose, the TokenStreamSelector object now calls the retry method, which internally throws a TokenStreamException that gets caught in TokenStreamSelector::nextToken. Here's the call stack:

TokenStreamSelector::nextToken
TLexer::nextToken
TLexer::uponEOF

Here's the code for TokenStreamSelector::nextToken:

RefToken TokenStreamSelector::nextToken()
  {
  // keep looking for a token until you don't
  // get a retry exception
  for (;;) {
    try {
      return input->nextToken();
    }
    catch (TokenStreamRetryException& /*r*/) {
      // just retry "forever"
  }
 }
}

You don't need to do anything to the nextToken method; its responsibility is only to properly define the uponEOF method for TLexer, create the stream-selector object, and attach it to the parser. The beauty of this approach is that input in the above code is now the switched stream, and the parser has been seamlessly provided the next token.

4 of 10 | Previous | Next

Comments



static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=AIX and UNIX
ArticleID=293295
TutorialTitle=Building custom language parsers
publish-date=03112008
author1-email=arpan@syncad.com
author1-email-cc=