Building custom language parsers

Solving common parsing problems with ANTLR

There are certain things about ANTLR that, if understood, help in faster debugging and provide a fuller appreciation of how the tool works. Learn how to use ANTLR to create smarter parsing solutions.

Arpan Sen, Lead Engineer, Synapti Computer Aided Design Pvt Ltd

Arpan Sen is a lead engineer working on the development of software in the electronic design automation industry. He has worked on several flavors of UNIX, including Solaris, SunOS, HP-UX, and IRIX as well as Linux and Microsoft Windows for several years. He takes a keen interest in software performance-optimization techniques, graph theory, and parallel computing. Arpan holds a post-graduate degree in software systems. You can reach him at arpansen@gmail.com.



11 March 2008

Before you start

Learn what to expect from this tutorial and how to get the most out of it.

About this tutorial

If you're in the business of developing a parser or compiler (considered by most a somewhat black art, indeed), there are several technical issues you must address. In recent times, ANother Tool for Language Recognition (ANTLR) has been gaining a lot of traction as the tool of choice for creating language parsers. This tutorial takes a close look at some of the typical issues you would encounter while creating a custom parser and shows how to solve those problems using ANTLR.

Objectives

In this tutorial, you learn how to create custom language parsing with the help of ANTLR. In addition, you learn how to address common issues that arise with compiler and parser creation.

Prerequisites

To fully appreciate this tutorial, you must have some familiarity with both language parsing and with ANTLR. All code in this tutorial has been tested with ANTLR version 2.7.2 and compiled with the GNU Compiler Collection (GCC) version 3.4.4.

System requirements

To run the examples in this tutorial, you need the following elements:


Getting the facts about ANTLR

Get the background information you need to work with ANTLR.

Essential ANTLR facts

There are certain things about ANTLR that, if understood, help in faster debugging and provide a fuller appreciation of how the tool works. With this knowledge, you'll also be able to design smarter parsing solutions. Here are two absolutely essential ANTLR facts that you must know.

LL(k) parsing and the look-ahead strategy

ANTLR is an LL(k) parser—that is, it follows a top-down parsing algorithm, with k tokens of look-ahead parsing the input stream from left to right. How do you decide how many tokens of look-ahead are needed for parsing in ANTLR? Consider a simple grammatical rule that has to support the greater than (>) and greater-than-or-equal-to (>=) operators. The lexer must decide on which token to create for the parser when it hits upon the > character in the input stream. There are two ways to handle this situation:

  • Using the $setType predefined ANTLR function:
    GT : '>' ('=' {$setType(GTE);})? ; 
    GTE : ">=";

    The trick here is to make the lexer check for the next character when it hits >. If the next character is =, create and return a token of type GTE using the setType routine.
  • Using k=2; in the Options section of the lexer: This specifies that the lexer will look at the next two characters of the input stream to determine which alternative to choose. The extra look-ahead is only used when actually necessary for disambiguation. (For example, in cases such as > and <, the extra look-ahead is not used.) However, exercise caution in deciding the look-ahead value: Generally, a larger k value results in a slower lexer.

Code generation

ANTLR takes in a grammar file (typically a file with a .g file extension) and generates C++ or Java™ code. A casual look into the generated code reveals that it has created separate classes for both the lexer and the parser. Every grammar rule is now implemented as a C++ or Java method that is part of the parser class, and every lexical identifier is likewise a method of the lexer class. Consider the following code snippet, which illustrates this fact for a single grammar rule and lexical identifier:

options {
  language="Cpp"; // implies that code is generated in C++
  namespace="preprocessor";
}
class TParser extends Parser;
declaration : INT ID;

class TLexer extends Lexer;
options { k=2;} 
tokens { 
  INT="int";
}

SEMI : ';' ;
ID :  ('a'..'z')+

Two classes are generated for the above snippet: TParser and TLexer. The generated code has a method called TParser::declaration corresponding to the grammar rule declaration; likewise, the TLexer::mSEMI is generated for the lexer rule SEMI. Let's look into the generated code for TLexer:

void TLexer::mSEMI(bool _createToken) {
        int _ttype; ANTLR_USE_NAMESPACE(antlr)RefToken _token; int _begin=text.length();
        _ttype = SEMI;
        int _saveIndex;

        match(static_cast<unsigned char>(';'));
        // .. some more ANTLR processing .. 
        _returnToken = _token;
        _saveIndex=0;
}

Likewise, for the TParser::declaration:

void TParser::declaration() 
  {
  try {      // for error handling
     match(INT);
     match(ID);
  }
  catch (ANTLR_USE_NAMESPACE(antlr)RecognitionException& ex) {
       reportError(ex);
       consume();
       consumeUntil(_tokenSet_0);
      }
}

Run the above example, and look into the generated sources. In the TLexer::mSEMI method, the match routine is of particular significance. The code for match is predefined inside ANTLR sources; it's a part of the CharScanner class from which TLexer is derived in the generated code. Simple enough to understand, it just validates for the existence of a specific character in the input stream. Here's the code for match from the ANTLR sources:

virtual void match(int c) // Part of CharScanner class
  {
  int la_1 = LA(1);
  if ( la_1 != c )
    throw MismatchedCharException(la_1, c, false, this);
  consume();
  }

(Note the exception that this code is throwing in case of a mismatch. This concept will be used later in discussing the error-handling strategies.)

The code for TParser::declaration is fairly self-defining: You match the two token types and, if the match is negative, throw an exception. Note that INT and ID, originally defined in the lexer, correspond to token types of type integer in the generated code. Here's the code for the match routine for the parser. Note that the LA (look-ahead) method in the parser looks for the token type and not any character or character stream:

void Parser::match(int t) // t -> token type 
{
  if ( DEBUG_PARSER )  {
    traceIndent();
   ANTLR_USE_NAMESPACE(std)cout << "enter match(" << t << ") with    
   LA(1)=" << LA(1) << ANTLR_USE_NAMESPACE(std)endl;
  }
  if ( LA(1)!=t ) {
    if ( DEBUG_PARSER ) {
      traceIndent();
      ANTLR_USE_NAMESPACE(std)cout << "token mismatch: " << LA(1) 
          << "!=" << t << ANTLR_USE_NAMESPACE(std)endl;
      }
     throw MismatchedTokenException(getTokenNames(), getNumTokens(),   
     LT(1), t, false, getFilename());
  } else {
     // mark token as consumed -- fetch next token deferred until LA/LT
     consume();
  }
}

These code snippets also go a long way toward affirming what you already know: At the lexer level, the code deals with individual characters, while at the parser level, you're dealing with tokens and their types. It's not important for you to have a full understanding of what the above code is doing, but the overall idea helps you understand how error handlers in ANTLR work.


Processing include files

Learn how to work with include files in ANTLR.

Include file processing

It is an extremely common occurrence in computer programs for source files to include other source or header files. Software programming languages such as C and C++ have the #include directive to include another file; even hardware description languages like Verilog have the 'include directive, which behaves identically.

In contrast, note that the language standard makes no special provision for supporting the include construct: The grammar for the language is always built on the assumption that the parser would receive a continuous token stream from the lexer. This assumption makes life tough for the compiler developer. Evidently, there must be a separate preprocessing step prior to actual parsing, which flattens out the included file contents before the flattened file is provided as an input to the lexer. Included files could in turn be nested. For example, in C++, you could have some .cpp file containing some header file, which in turn could potentially include a bunch of other headers—all of which are from different folders.

Using ANTLR, there's no need for a separate preprocessing step. Here's a very small subset of C++ code as a potential grammar. It allows only int and long declarations and #include directives. Before providing the grammar file for this small subset, I should mention that your grammar would probably be different from a similar grammar file for, say, flex/bison or lex/yacc, which is based on preprocessing. More specifically, your grammar would have a lexical token for the include directive, and actions would be taken on encountering the directive. Listing 1 shows the bare-bones grammar file.

Listing 1. Basic grammar file
header { 
#include <string>
#include <iostream>
}

options {
  language = Cpp;
}

class TParser extends Parser;

start: ( declaration )+ ;
declaration: type a:ID
                   { ANTLR_USE_NAMESPACE(std)cout << "decl " 
                       << a->getText() << ANTLR_USE_NAMESPACE(std)endl; }
                   (COMMA b:ID
                   { ANTLR_USE_NAMESPACE(std)cout << "decl " 
                       << b->getText() << ANTLR_USE_NAMESPACE(std)endl; }
                   )* SEMI; 
type: INT | LONG; 

{
#include <fstream>
#include <iostream>
#include "TParser.hpp"
}

class TLexer extends Lexer; 
tokens {
  INT="int";
  LONG="long";
}

options {
  charVocabulary = '\3'..'\377';
  k=2;
}

SEMI: ';';
COMMA: ',';
ID: ('a'..'z')+; 
INCLUDE: "#include" (WS)? STRING;
STRING: '"'! ( ~'"' )* '"'!
WS: (	' '      |	'\t'     |	'\f'
                       |   (options {generateAmbigWarnings=false;}
                           :	"\r\n"  // DOS
                           |	  '\r'     // Macintosh
                           |  '\n'     // UNIX
                           )
                     { newline(); }
          )+
        { $setType(ANTLR_USE_NAMESPACE(antlr)Token::SKIP); } ;

Strategies for handling the INCLUDE directive

The easiest thing to do is to create a new copy of the lexer and parser classes, TLexer and TParser, on encountering the INCLUDE directive, feed the lexer with the included file, and then pass the lexer information to the parser. The logic here is very simple: The included file's syntax is supposed to be the same as that of the file currently being parsed (C headers and source files have identical Backus-Naur Form [BNF] notation). So, simply creating a new lexer-parser combination that parses the included file and returns control to the current parser should do the trick. This automatically handles nested include files, as well (that is, included files themselves containing #include directives).

The code in Listing 2 is self-explanatory. There is an issue with this code, however. Can you spot it?

Listing 2. Individual lexer and parser per input stream
INCLUDE :  "#include" (WS)? f:STRING
  {
  ANTLR_USING_NAMESPACE(std)
  // create lexer to handle include
  string name = f->getText();
  ifstream* input = new ifstream(name.c_str());
  if (!*input) {
    cerr << "cannot find file " << name << endl;
  }

 TLexer inclexer(*input);
 TParser incParser(inclexer);
 inclexer.setFilename(name);
 incParser.setFilename(name);

  incParser.startRule();
  $setType(ANTLR_USE_NAMESPACE(antlr)Token::SKIP);
  };

The code in Listing 2 works well, but the problem is that for every included file, a new lexer-parser pair is created, and that's never a good use of the available memory. In its most twisted form, N#include directives will result in N copies of TParser and TLexer in memory, thus resulting in a bad memory footprint. Fortunately for us, there's a smart way of tackling this issue in ANTLR: use of the built-in TokenStreamSelector class. Consider the code in Listing 3.

Listing 3. Include file processing using the TokenStreamSelector class
#include <main.h> 
INCLUDE  :  "#include" (WS)? f:STRING
    {
    ANTLR_USING_NAMESPACE(std)
    // create lexer to handle include
    string name = f->getText();
    ifstream* input = new ifstream(name.c_str());
    if (!*input) {
        cerr << "cannot find file " << name << endl;
    }
    TLexer* sublexer = new TLexer(*input);
    // make sure errors are reported in right file
    sublexer->setFilename(name);
    parser->setFilename(name);

    // push the previous lexer stream and make sublexer current
    selector.push(sublexer);
    // ignore this token, re-look for token in the switched stream
    selector.retry(); // throws TokenStreamRetryException
    }
    ;

Contents of main.h: 
#ifndef _MAIN_HPP
#define _MAIN_HPP

#include "antlr/TokenStreamSelector.hpp"

class TParser;
class TLexer;

// Define a selector that can handle nested include files.
// These variables are public so the parser/lexer can see them.
extern ANTLR_USE_NAMESPACE(antlr)TokenStreamSelector selector;
extern TParser* parser;
extern TLexer* mainLexer;

#endif //_MAIN_HPP

So, what's happening here? Instead of attaching the lexer with the parser directly, as you did earlier, you've attached an object of type TokenStreamSelector to the parser. In turn, the TokenStreamSelector object has been initialized with the lexer object.

How does all this add up? The parser is always looking for the next token, and it doesn't matter who provides it. In contrast, it's possible to attach multiple lexer streams to the TokenStreamSelector class, and you can always switch to the desired input stream using the select method defined as part of the TokenStreamSelector class. For every include directive, you create a new TLexer object and switch the input stream accordingly; on hitting the end of file (EOF) for the stream, you switch back to the previous stream. Listing 4 shows the source code.

Listing 4. The uponEOF method in the lexer and the main.cpp source listing
class TLexer extends Lexer;
…
options {
  charVocabulary = '\3'..'\377';
  k=2;
}

{
public:
    void uponEOF()  {
        if ( selector.getCurrentStream() != mainLexer ) {
            selector.pop(); // return to old lexer/stream
            selector.retry();
        }
        else {
             ANTLR_USE_NAMESPACE(std)cout << "Hit EOF of main file\n"  ;
        }
    }
}

main.cpp listing: 
#include <iostream>
#include "main.h"
#include "TLexer.hpp"
#include "TParser.hpp"

ANTLR_USING_NAMESPACE(std)
ANTLR_USING_NAMESPACE(antlr)
TokenStreamSelector selector;
TParser* parser;
TLexer* mainLexer;

int main(int argc,char** argv)
{
   try {
       std::ifstream inputstream("test.c", std::ifstream::in);
       mainLexer = new TLexer(inputstream);
       // notify selector about starting lexer; name for convenience
       selector.addInputStream(mainLexer, "main");
       selector.select("main"); // start with main lexer

       // Create parser attached to selector
       parser = new TParser(selector);
       parser->setFilename("test.c”);
       parser->startRule();
  } 
  catch (exception& e) {
    cerr << "exception: " << e.what() << endl;
  }
  return 0;
}

A bit of explanation is in order, but before that, let's analyze the impact of the code. Even in the worst-case scenario of N nested #include directives, you have N copies of TLexer and only one copy of TParser. That's a significant savings of memory.


Compilers

Before delving into the sources, a bit of the fundamentals of compilers is in order.

Compiler fundamentals

The parser always looks for the next token to match the grammar rules, and the lexer keeps supplying a token until it hits the end of the input stream. Internally, the ANTLR classes CharScanner (from which TLexer is derived) and TokenStreamSelector are both derived from the TokenStream class and have defined their own versions of the nextToken method, which keeps returning the next token from the input stream. The TParser class doesn't really care about the input stream, and it has got absolutely no sense of the fact that the input stream has switched and keeps calling the nextToken method of the associated TokenStream object (which could be a TLexer or TokenStreamSelector).

Understand the code

With this understanding, take another look at the code in Listing 4. At the beginning, the parser is initialized with a TokenStreamSelector object, which in turn is initialized with a TLexer object that's created to parse the first source file. On encountering an INCLUDE directive, a new Lexer object is created to parse the included file, the previous TLexer (and the input stream thereof) is pushed back inside a stack maintained as part of the TokenStreamSelector object, and you attach the newly created TLexer to the TokenStreamSelector for providing the next token. (Actually, the push method does this.)

The uponEOF method

The final piece of the puzzle is the uponEOF method that you've defined as part of the lexer. Actually, if you look into the generated ANTLR code, you'll see that you're actually redefining the uponEOF method for the TLexer class, which has an empty body in the CharScanner class from which TLexer is derived. This method is called from the nextToken method of the TLexer class when it hits the end of an input stream. (Check out the ANTLR-generated code to understand this in further detail.)

So, what does uponEOF do? It simply switches back to the previous input stream by calling the pop method when it hits on the end of the current input stream. Is that all? Nope: Remember that this method is being called from within the lexer, which in turn is being called by the parser to provide the next token. So, you must arrange for the TokenStreamSelector::nextToken to return the next token from the switched stream.

For this purpose, the TokenStreamSelector object now calls the retry method, which internally throws a TokenStreamException that gets caught in TokenStreamSelector::nextToken. Here's the call stack:

TokenStreamSelector::nextToken
TLexer::nextToken
TLexer::uponEOF

Here's the code for TokenStreamSelector::nextToken:

RefToken TokenStreamSelector::nextToken()
  {
  // keep looking for a token until you don't
  // get a retry exception
  for (;;) {
    try {
      return input->nextToken();
    }
    catch (TokenStreamRetryException& /*r*/) {
      // just retry "forever"
  }
 }
}

You don't need to do anything to the nextToken method; its responsibility is only to properly define the uponEOF method for TLexer, create the stream-selector object, and attach it to the parser. The beauty of this approach is that input in the above code is now the switched stream, and the parser has been seamlessly provided the next token.


Optimizing your include file processing

In this section, learn how to optimize the performance of include file processing.

Performance optimization

The TokenStreamSelector object scheme works well, but you're still creating a new lexer for each include directive. It would be worth the effort to get the same thing done using a single lexer-parser combination, making your code extremely lean.

Note: The technique described here is specific to ANTLR version 2.7.2. Future versions of ANTLR might have internal data structures modified, and this method won't work. However, understanding this scheme will give you a better grasp of how ANTLR works internally: You can always adapt this method to future generations of the ANTLR tool.

Every ANTLR lexer maintains internal fields for the file name, current line number, column number, and the input stream (derived from std::ifstream). To use the same lexer across files, you must have a way of saving this data when you hit upon an include directive, resetting the internal lexer fields, and continuing to process the included file. On encountering EOF for the included file, you switch back to the previous file by restoring the data you saved earlier onto the lexer. Clearly, you need to define a structure that maintains these four fields and a stack of these structures. You also need a global input stream pointer to keep track of the current input stream. Here's the initial code:

#include <stack>

typedef struct LexerState { 
  int line, column;
  std::string filename;
  std::ifstream* input;
  LexerState() : line(0), column(0), input(0) { }
  LexerState(int lineNo, int colNo, std::string file,  std::ifstream* in) : 
    line(lineNo), column(colNo), filename(file), input(in) { }
  } LexerState;

std::stack<LexerState> LexerStateStack;
std::ifstream* gCurrentInputStream = 0;

Now, on encountering the INCLUDE token in the lexer, perform the following steps:

  1. Populate a LexerState object from the current lexer.
  2. Push this LexerState object into the LexerStateStack.
  3. Reset the internal fields of the lexer.
  4. Change the gCurrentInputStream to point to the switched stream.

Listing 5 shows this process.

Listing 5. Switching input streams
INCLUDE :  "#include" (WS_)? f:STRING
  {
  ANTLR_USING_NAMESPACE(std)
  string name = f->getText();
  std::ifstream* input = new std::ifstream(name.c_str());
  if (!*input) {
    cerr << "cannot find file " << name << endl;
  }
  // store the current input state
  LexerState state(this->getLine(), this->getColumn(), this->getFilename(),
          gCurrentInputStream);
  LexerStateStack.push(state);
        
  // reset the input state
  ((antlr::LexerInputState*)(this->getInputState()))->reset();
  this->setFilename(name);
  this->getInputState()->initialize(*input, name.c_str());
  gCurrentInputStream = input;
  parser->setFilename(name);
  };

Is this sufficient? Unfortunately, no.

Proper parser functioning

You're still missing out on an important premise: The parser is not aware of the stream switch. You must do two things for the parser to function properly, even after the stream is switched:

  • The INCLUDE token must not be passed to the parser. The parser has no grammar rule to handle an INCLUDE token. You need a way of skipping this token and moving on to the next token in the switched stream.
  • When you hit the end of an included file, switch back to the previous stream in a way that is opaque to the parser. You already know that the nextToken method of the lexer is the only thing the parser is aware of, so you must tweak it. In the ANTLR-generated code (discussed earlier), the lexer class already has a nextToken method. After encountering the INCLUDE token and subsequent stream switch, recall the nextToken method. Also note that directly modifying the generated nextToken method is a bad idea, because the method is overridden every time you run ANTLR on the grammar file. It's best, then, to derive a class directly from the TLexer class and modify the nextToken method inside to achieve this.

To solve problem 1, add the following piece of code inside the lexer rule for INCLUDE:

...
parser->setFilename(name); // from previous figure
$setType(ANTLR_USE_NAMESPACE(antlr)Token::SKIP);

This code makes sure that the lexer nextToken method in the generated code does not return the INCLUDE token to the parser and look again into the input stream for the next token. In effect, it is skipping the INCLUDE token.

To solve the problem 2, you derive a new class—MLexer—from the existing TLexer, then override the uponEOF and nextToken methods accordingly, as shown in Listing 6.

Listing 6. The MLexer class
class MLexer : public TLexer 
  {
  public: 
    MLexer(std::ifstream& in) : TLexer(in) { }
    void uponEOF() {
      if ( !LexerStateStack.empty() ) {
        LexerState state = LexerStateStack.top();
        LexerStateStack.pop(); 
        this->getInputState()->initialize(*state.input, state.filename.c_str());
        this->setLine(state.line);
        this->setColumn(state.column);
        gCurrentInputStream = state.input;
        throw ANTLR_USE_NAMESPACE(antlr)TokenStreamRetryException();
        }
      else {
	ANTLR_USE_NAMESPACE(std)cout << "Hit EOF of main file" << 
	  ANTLR_USE_NAMESPACE(std)endl;
      }
  }
 RefToken nextToken() {
   // keep looking for a token until you don't get a retry exception
      for (;;) {
        try {
          return TLexer::nextToken();
        }
        catch (TokenStreamRetryException& /*r*/) {
          // just retry "forever"
        }
      }
    }
  };

Note that the uponEOF method uses an exception-handling mechanism to return control to the MLexer::nextToken method. The TLexer::nextToken method doesn't catch TokenStreamRetryException, because the method isn't expected to skip tokens.

Modify the main routine

You must also modify the main routine. Instead of creating a TLexer object, you now create an object of type MLexer. The rest of the sources and the grammar file remain unchanged. Listing 7 shows the main routine.

Listing 7. The modified main method with the MLexer class
int main(int argc,char** argv)
{
  try {
    std::ifstream inputstream("test.c", std::ifstream::in);
    MLexer* mainLexer = new MLexer(inputstream);
    mainLexer->setFilename("test.c");
    
    parser = new PParser(*mainLexer);
    parser->setFilename("test.c");
    gCurrentInputStream = &inputstream;
    parser->startRule();
  }
  catch(exception& e) {
    cerr << "exception: " << e.what() << endl;
  }
return 0;
}

Run the code in Listing 7 inside a good debugger. Observe in particular how the nextToken method works.


Error handling

The next topic is the error-handling strategy in the parser.

Recover from errors in user code

Error handling can easily be deemed the differentiator between an amateur and a professional compiler developer. Users typically expect a compiler to parse the entire input stream as opposed to exiting on the first occurrence of an error in the user code. For the compiler, this means that it must recover from the error after it encounters an erroneous token, and the parser must keep consuming the tokens until it reaches a known state.

ANTLR has several built-in exceptions to ease the programmer's burden. But before that, here's the general form of the exception handler.

rule :  <..grammar rules..> ; 
          exception [label] 
          catch exception [antlr::ANTLRException& e] {
             // do the needed error handling
          }

The syntax is similar to the C++ exception-handling strategy. Exceptions are allowed for a grammar rule as a whole for alternatives to a grammar rule or a labeled statement. It's easy enough to understand why this exception-handling strategy is designed the way it is: Each grammar rule in the generated code is implemented as a method of the parser class in the generated C++ code. Inside each such method, you have a try...catch block implementing the exception-handling functionality; the code is copied almost verbatim from the exception handler in the grammar file.

Manipulate the exception class hierarchy

By default, exception handling is turned on when ANTLR generates code. But if you want to create a custom exception handler, you can do so by using and extending the available exception class hierarchy. This means that, by default, the generated code has got try...catch blocks and necessary code to handle exceptions—as and when they are thrown. To turn default error-handling off, add defaultErrorHandler=false; to the options section of the parser:

class TParser extends Parser;
options { 
  k=2;
  defaultErrorHandler=false;
}

Note that irrespective of whether the defaultErrorHandler option is present, the code for I/O exceptions (TokenStreamIOException) is always generated. If the defaultErrorHandler option is False and no error-handling strategy is adhered to in the parser, the exception is propagated all the way back to the calling code. You must then provide for try...catch blocks in the calling routine, as in the following snippet:

int main(int argc,char** argv)
{
  try {
    … // usual code to set up lexer/parser
    parser->startRule();
  }
  catch(ANTLR_USE_NAMESPACE(antlr)RecognitionException& e) {
    // do the needful 
  }
  catch(ANTLR_USE_NAMESPACE(antlr)TokenStreamException& e) { 
    // do the needful
  }
return 0;
}

All ANTLR exceptions are derived from the ANTLRException class. The class hierarchy is shown in Figure 1.

Figure 1. ANTLR exception hierarchy
ANTLR exception hierarchy

One of the first things to understand about ANTLR exception handling is that the exception-handling mechanism is not restricted to the parser. The lexer uses the same exception-handling scheme, and earlier in this tutorial, you used the TokenStreamRetryException in good measure. The parser and the grammar rules thereof use the RecognitionException and the TokenStreamException classes, while the lexer uses all three variants of the exception. The sections that follow provide a basic description of the two most common exceptions that the parser uses in the default mode.

antlr::RecognitionException::MismatchedTokenException

The antlr::RecognitionException::MismatchedTokenException exception is thrown when the parser finds a different token from what was expected. Consider the grammar from Listing 1, and now consider the following faulty input:

int err
int a,b;
#include "incl.h"
int c;

Note: Assume that incl.h exists in the same directory and has a single line containing int x, z, y.

Instead of encountering a semicolon (the token SEMI), the parser encounters an integer declaration (the INT token from the lexer) after it has processed the token for err. This is clearly a case of a mismatched token. Accordingly, you get the following ANTLR output:

decl err
test.c:2:1: expecting SEMI, found 'int'
decl x
decl z
decl y
decl c

It's possible to make the above error message slightly more verbose by allowing for the paraphrase option in the parser. Here's the parser rule for SEMI rewritten:

SEMI
  options { 
    paraphrase="semicolon";
  }
    :  ';' ;

The output would now read:

decl err
test.c:2:1: expecting semicolon, found 'int'
decl x
decl z
decl y
decl c

antlr::RecognitionException::NoViableAltException

The antlr::RecognitionException::NoViableAltException exception is thrown when the parser finds an unexpected token while making a call on the current alternatives to a grammar rule. Note that this exception is similar to the mismatched-token exception in the sense that an unexpected token is hit upon in the parser; however, this exception is thrown when the unexpected token is the first in a series of tokens, while the mismatched exception is thrown if any other token is not expected in the stream. Consider a small variant of the errant input discussed earlier:

err
int a,b;
#include "incl.h"
int c;

In this case, the startRule directive of the parser was expecting a token of type INT or LONG but ends up with a token that is of neither type. Clearly, there is no viable alternative to any of the grammar rules; hence, the NoViableAltException exception is thrown. The error message printed is:

test.c:1:1: unexpected token: err
decl x
decl z
decl y

Again, if an empty file is provided as the input, this too results in NoViableAltException being thrown, because there's no match for the EOF token while looking for alternatives in startRule. The message that comes this time around is:

test.c:1:1: unexpected end of file

Common lexer classes

This section briefly looks into some of the more common exception classes typically used in a lexer.

Exceptions for rogue or unexpected characters

The antlr::RecognitionException::MismatchedCharException exception is thrown when the CharScanner.match() method hits upon a rogue character. Consider this snippet:

int a#;
#include "incl.h" 
int c;

Because the lexer didn't expect #, it throws a mismatched exception. An error message such as this appears:

test.c:1:6: expecting semicolon, found '#;
#include "incl.h"
err
int c;

If the lexer finds an unexpected character while trying to make a decision on the token type, it throws the antlr::RecognitionException::NoViableAltForCharException exception. For the input int 5a,b;, you can verify that this is indeed the case: The definition for ID in your grammar does not include numbers.

Why bother with the exceptions?

So, why bother yourself with the exceptions when ANTLR is doing a good job of handling them already in the default scheme of things? One typical reason is that you might need a more tool-specific message than what ANTLR is providing. For example, while parsing a language and reporting errors, it's always a good idea to specify the section of the language standard the user input has violated.

However, to understand the more subtle reason why it makes sense to override the default scheme at times, you must look into the generated code. Using the code from Listing 1 and Listing 2, look into the error file you used earlier for NoViableAltException from the parser and its output when default ANTLR exception handling is on:

Error File: 
err
int a,b;
#include "incl.h"
int c;

Output: 
test.c:1:1: unexpected token: err
decl x
decl z
decl y

You don't seem to be getting any messages for the declarations of a, b, and c. To understand why this is so, look at the generated code for startRule:

void TParser::startRule() {
   try {      // for error handling
     ...        // usual parser code
   }
   catch (ANTLR_USE_NAMESPACE(antlr)RecognitionException& ex) {
      reportError(ex);
      consume();
      consumeUntil(_tokenSet_0);
   }
}

The interest lies in understanding what's going on in the catch block. All three methods—reportError, consume, and consumeUntil—are defined as part of the Parser class from which TParser is derived. The reportError method does the printing bit: "unexpected token" or "unexpected end of file." The consume method does what its name suggests: It consumes the current token (for example, ID created for err).

Finally, the consumeUntil method keeps gobbling up tokens until it reaches EOF. It's during this gobbling up that you get the lexer hit on the INCLUDE token, which in turn prints the declarations for x, z, and y. Clearly, you need a way of restoring grammatical checks on the input stream after encountering the token for err.

Turn the defaultErrorHandler off, then add the following snippet to the startRule:

startRule  :  ( decl )+
  ;
exception
catch [ANTLR_USE_NAMESPACE(antlr)NoViableAltException& e]
    {
    reportError(e);
    consume();
    consumeUntil(INT); // keep looking till you find an INT
    startRule(); // re-run the rule 
    }

With this snippet, the output now looks a lot more reasonable:

test.c:1:1: unexpected token: err
decl a
decl b
decl x
decl z
decl y
decl c

So, what have you done here? After you consume the error token, you keep looking for an int declaration. When you find one, you re-run startRule to parse the input stream again. Typically, user error handlers are variants of this strategy; the only purpose this serves is to provide for a more verbose error recovery.


Summary

This tutorial covered both include file processing and error handling in ANTLR version 2.7.2. Note that future versions of ANTLR might address the same issues in a different manner; as such, you should strictly adhere to the documentation of the ANTLR version you're using.

Resources

Learn

Get products and technologies

  • cpp: For a complete C++ preprocessor using ANTLR and Java as the implementation language, try cpp.
  • ANTLR graphical user interface: Check out Placid Systems' commercial ANTLR GUI.
  • IBM trial software: Build your next development project with software for download directly from developerWorks.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into AIX and Unix on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=AIX and UNIX
ArticleID=293295
ArticleTitle=Building custom language parsers
publish-date=03112008