Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

developerWorks Community:

  • Close [x]

Building custom language parsers

Solving common parsing problems with ANTLR

Arpan Sen (arpan@syncad.com), Lead Engineer, Synapti Computer Aided Design Pvt Ltd
Arpan Sen is a lead engineer working on the development of software in the electronic design automation industry. He has worked on several flavors of UNIX, including Solaris, SunOS, HP-UX, and IRIX as well as Linux and Microsoft Windows for several years. He takes a keen interest in software performance-optimization techniques, graph theory, and parallel computing. Arpan holds a post-graduate degree in software systems. You can reach him at arpansen@gmail.com.
(An IBM developerWorks Contributing Author)

Summary:  There are certain things about ANTLR that, if understood, help in faster debugging and provide a fuller appreciation of how the tool works. Learn how to use ANTLR to create smarter parsing solutions.

Date:  11 Mar 2008
Level:  Intermediate PDF:  A4 and Letter (86 KB)Get Adobe® Reader®

Activity:  17525 views
Comments:  

Common lexer classes

This section briefly looks into some of the more common exception classes typically used in a lexer.

Exceptions for rogue or unexpected characters

The antlr::RecognitionException::MismatchedCharException exception is thrown when the CharScanner.match() method hits upon a rogue character. Consider this snippet:

int a#;
#include "incl.h" 
int c;

Because the lexer didn't expect #, it throws a mismatched exception. An error message such as this appears:

test.c:1:6: expecting semicolon, found '#;
#include "incl.h"
err
int c;

If the lexer finds an unexpected character while trying to make a decision on the token type, it throws the antlr::RecognitionException::NoViableAltForCharException exception. For the input int 5a,b;, you can verify that this is indeed the case: The definition for ID in your grammar does not include numbers.


Why bother with the exceptions?

So, why bother yourself with the exceptions when ANTLR is doing a good job of handling them already in the default scheme of things? One typical reason is that you might need a more tool-specific message than what ANTLR is providing. For example, while parsing a language and reporting errors, it's always a good idea to specify the section of the language standard the user input has violated.

However, to understand the more subtle reason why it makes sense to override the default scheme at times, you must look into the generated code. Using the code from Listing 1 and Listing 2, look into the error file you used earlier for NoViableAltException from the parser and its output when default ANTLR exception handling is on:

Error File: 
err
int a,b;
#include "incl.h"
int c;

Output: 
test.c:1:1: unexpected token: err
decl x
decl z
decl y

You don't seem to be getting any messages for the declarations of a, b, and c. To understand why this is so, look at the generated code for startRule:

void TParser::startRule() {
   try {      // for error handling
     ...        // usual parser code
   }
   catch (ANTLR_USE_NAMESPACE(antlr)RecognitionException& ex) {
      reportError(ex);
      consume();
      consumeUntil(_tokenSet_0);
   }
}

The interest lies in understanding what's going on in the catch block. All three methods— reportError, consume, and consumeUntil —are defined as part of the Parser class from which TParser is derived. The reportError method does the printing bit: "unexpected token" or "unexpected end of file." The consume method does what its name suggests: It consumes the current token (for example, ID created for err).

Finally, the consumeUntil method keeps gobbling up tokens until it reaches EOF. It's during this gobbling up that you get the lexer hit on the INCLUDE token, which in turn prints the declarations for x, z, and y. Clearly, you need a way of restoring grammatical checks on the input stream after encountering the token for err.

Turn the defaultErrorHandler off, then add the following snippet to the startRule:

startRule  :  ( decl )+
  ;
exception
catch [ANTLR_USE_NAMESPACE(antlr)NoViableAltException& e]
    {
    reportError(e);
    consume();
    consumeUntil(INT); // keep looking till you find an INT
    startRule(); // re-run the rule 
    }

With this snippet, the output now looks a lot more reasonable:

test.c:1:1: unexpected token: err
decl a
decl b
decl x
decl z
decl y
decl c

So, what have you done here? After you consume the error token, you keep looking for an int declaration. When you find one, you re-run startRule to parse the input stream again. Typically, user error handlers are variants of this strategy; the only purpose this serves is to provide for a more verbose error recovery.

7 of 10 | Previous | Next

Comments



static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=AIX and UNIX
ArticleID=293295
TutorialTitle=Building custom language parsers
publish-date=03112008
author1-email=arpan@syncad.com
author1-email-cc=