Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

developerWorks Community:

  • Close [x]

Building custom language parsers

Solving common parsing problems with ANTLR

Arpan Sen (arpan@syncad.com), Lead Engineer, Synapti Computer Aided Design Pvt Ltd
Arpan Sen is a lead engineer working on the development of software in the electronic design automation industry. He has worked on several flavors of UNIX, including Solaris, SunOS, HP-UX, and IRIX as well as Linux and Microsoft Windows for several years. He takes a keen interest in software performance-optimization techniques, graph theory, and parallel computing. Arpan holds a post-graduate degree in software systems. You can reach him at arpansen@gmail.com.
(An IBM developerWorks Contributing Author)

Summary:  There are certain things about ANTLR that, if understood, help in faster debugging and provide a fuller appreciation of how the tool works. Learn how to use ANTLR to create smarter parsing solutions.

Date:  11 Mar 2008
Level:  Intermediate PDF:  A4 and Letter (86 KB)Get Adobe® Reader®

Activity:  17525 views
Comments:  

Error handling

The next topic is the error-handling strategy in the parser.

Recover from errors in user code

Error handling can easily be deemed the differentiator between an amateur and a professional compiler developer. Users typically expect a compiler to parse the entire input stream as opposed to exiting on the first occurrence of an error in the user code. For the compiler, this means that it must recover from the error after it encounters an erroneous token, and the parser must keep consuming the tokens until it reaches a known state.

ANTLR has several built-in exceptions to ease the programmer's burden. But before that, here's the general form of the exception handler.

rule :  <..grammar rules..> ; 
          exception [label] 
          catch exception [antlr::ANTLRException& e] {
             // do the needed error handling
          }

The syntax is similar to the C++ exception-handling strategy. Exceptions are allowed for a grammar rule as a whole for alternatives to a grammar rule or a labeled statement. It's easy enough to understand why this exception-handling strategy is designed the way it is: Each grammar rule in the generated code is implemented as a method of the parser class in the generated C++ code. Inside each such method, you have a try...catch block implementing the exception-handling functionality; the code is copied almost verbatim from the exception handler in the grammar file.


Manipulate the exception class hierarchy

By default, exception handling is turned on when ANTLR generates code. But if you want to create a custom exception handler, you can do so by using and extending the available exception class hierarchy. This means that, by default, the generated code has got try...catch blocks and necessary code to handle exceptions—as and when they are thrown. To turn default error-handling off, add defaultErrorHandler=false; to the options section of the parser:

class TParser extends Parser;
options { 
  k=2;
  defaultErrorHandler=false;
}

Note that irrespective of whether the defaultErrorHandler option is present, the code for I/O exceptions (TokenStreamIOException) is always generated. If the defaultErrorHandler option is False and no error-handling strategy is adhered to in the parser, the exception is propagated all the way back to the calling code. You must then provide for try...catch blocks in the calling routine, as in the following snippet:

int main(int argc,char** argv)
{
  try {
    … // usual code to set up lexer/parser
    parser->startRule();
  }
  catch(ANTLR_USE_NAMESPACE(antlr)RecognitionException& e) {
    // do the needful 
  }
  catch(ANTLR_USE_NAMESPACE(antlr)TokenStreamException& e) { 
    // do the needful
  }
return 0;
}

All ANTLR exceptions are derived from the ANTLRException class. The class hierarchy is shown in Figure 1.


Figure 1. ANTLR exception hierarchy
ANTLR exception hierarchy

One of the first things to understand about ANTLR exception handling is that the exception-handling mechanism is not restricted to the parser. The lexer uses the same exception-handling scheme, and earlier in this tutorial, you used the TokenStreamRetryException in good measure. The parser and the grammar rules thereof use the RecognitionException and the TokenStreamException classes, while the lexer uses all three variants of the exception. The sections that follow provide a basic description of the two most common exceptions that the parser uses in the default mode.

antlr::RecognitionException::MismatchedTokenException

The antlr::RecognitionException::MismatchedTokenException exception is thrown when the parser finds a different token from what was expected. Consider the grammar from Listing 1, and now consider the following faulty input:

int err
int a,b;
#include "incl.h"
int c;

Note: Assume that incl.h exists in the same directory and has a single line containing int x, z, y.

Instead of encountering a semicolon (the token SEMI), the parser encounters an integer declaration (the INT token from the lexer) after it has processed the token for err. This is clearly a case of a mismatched token. Accordingly, you get the following ANTLR output:

decl err
test.c:2:1: expecting SEMI, found 'int'
decl x
decl z
decl y
decl c

It's possible to make the above error message slightly more verbose by allowing for the paraphrase option in the parser. Here's the parser rule for SEMI rewritten:

SEMI
  options { 
    paraphrase="semicolon";
  }
    :  ';' ;

The output would now read:

decl err
test.c:2:1: expecting semicolon, found 'int'
decl x
decl z
decl y
decl c

antlr::RecognitionException::NoViableAltException

The antlr::RecognitionException::NoViableAltException exception is thrown when the parser finds an unexpected token while making a call on the current alternatives to a grammar rule. Note that this exception is similar to the mismatched-token exception in the sense that an unexpected token is hit upon in the parser; however, this exception is thrown when the unexpected token is the first in a series of tokens, while the mismatched exception is thrown if any other token is not expected in the stream. Consider a small variant of the errant input discussed earlier:

err
int a,b;
#include "incl.h"
int c;

In this case, the startRule directive of the parser was expecting a token of type INT or LONG but ends up with a token that is of neither type. Clearly, there is no viable alternative to any of the grammar rules; hence, the NoViableAltException exception is thrown. The error message printed is:

test.c:1:1: unexpected token: err
decl x
decl z
decl y

Again, if an empty file is provided as the input, this too results in NoViableAltException being thrown, because there's no match for the EOF token while looking for alternatives in startRule. The message that comes this time around is:

test.c:1:1: unexpected end of file

6 of 10 | Previous | Next

Comments



static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=AIX and UNIX
ArticleID=293295
TutorialTitle=Building custom language parsers
publish-date=03112008
author1-email=arpan@syncad.com
author1-email-cc=