Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

developerWorks Community:

  • Close [x]

Building custom language parsers

Solving common parsing problems with ANTLR

Arpan Sen (arpan@syncad.com), Lead Engineer, Synapti Computer Aided Design Pvt Ltd
Arpan Sen is a lead engineer working on the development of software in the electronic design automation industry. He has worked on several flavors of UNIX, including Solaris, SunOS, HP-UX, and IRIX as well as Linux and Microsoft Windows for several years. He takes a keen interest in software performance-optimization techniques, graph theory, and parallel computing. Arpan holds a post-graduate degree in software systems. You can reach him at arpansen@gmail.com.
(An IBM developerWorks Contributing Author)

Summary:  There are certain things about ANTLR that, if understood, help in faster debugging and provide a fuller appreciation of how the tool works. Learn how to use ANTLR to create smarter parsing solutions.

Date:  11 Mar 2008
Level:  Intermediate PDF:  A4 and Letter (86 KB)Get Adobe® Reader®

Activity:  17525 views
Comments:  

Getting the facts about ANTLR

Get the background information you need to work with ANTLR.

Essential ANTLR facts

There are certain things about ANTLR that, if understood, help in faster debugging and provide a fuller appreciation of how the tool works. With this knowledge, you'll also be able to design smarter parsing solutions. Here are two absolutely essential ANTLR facts that you must know.

LL(k) parsing and the look-ahead strategy

ANTLR is an LL(k) parser—that is, it follows a top-down parsing algorithm, with k tokens of look-ahead parsing the input stream from left to right. How do you decide how many tokens of look-ahead are needed for parsing in ANTLR? Consider a simple grammatical rule that has to support the greater than (>) and greater-than-or-equal-to (>=) operators. The lexer must decide on which token to create for the parser when it hits upon the > character in the input stream. There are two ways to handle this situation:

  • Using the $setType predefined ANTLR function:
    GT : '>' ('=' {$setType(GTE);})? ; 
    GTE : ">=";
    

    The trick here is to make the lexer check for the next character when it hits >. If the next character is =, create and return a token of type GTE using the setType routine.
  • Using k=2; in the Options section of the lexer: This specifies that the lexer will look at the next two characters of the input stream to determine which alternative to choose. The extra look-ahead is only used when actually necessary for disambiguation. (For example, in cases such as > and <, the extra look-ahead is not used.) However, exercise caution in deciding the look-ahead value: Generally, a larger k value results in a slower lexer.

Code generation

ANTLR takes in a grammar file (typically a file with a .g file extension) and generates C++ or Java™ code. A casual look into the generated code reveals that it has created separate classes for both the lexer and the parser. Every grammar rule is now implemented as a C++ or Java method that is part of the parser class, and every lexical identifier is likewise a method of the lexer class. Consider the following code snippet, which illustrates this fact for a single grammar rule and lexical identifier:

options {
  language="Cpp"; // implies that code is generated in C++
  namespace="preprocessor";
}
class TParser extends Parser;
declaration : INT ID;

class TLexer extends Lexer;
options { k=2;} 
tokens { 
  INT="int";
}

SEMI : ';' ;
ID :  ('a'..'z')+

Two classes are generated for the above snippet: TParser and TLexer. The generated code has a method called TParser::declaration corresponding to the grammar rule declaration; likewise, the TLexer::mSEMI is generated for the lexer rule SEMI. Let's look into the generated code for TLexer:

void TLexer::mSEMI(bool _createToken) {
        int _ttype; ANTLR_USE_NAMESPACE(antlr)RefToken _token; int _begin=text.length();
        _ttype = SEMI;
        int _saveIndex;

        match(static_cast<unsigned char>(';'));
        // .. some more ANTLR processing .. 
        _returnToken = _token;
        _saveIndex=0;
}

Likewise, for the TParser::declaration:

void TParser::declaration() 
  {
  try {      // for error handling
     match(INT);
     match(ID);
  }
  catch (ANTLR_USE_NAMESPACE(antlr)RecognitionException& ex) {
       reportError(ex);
       consume();
       consumeUntil(_tokenSet_0);
      }
}

Run the above example, and look into the generated sources. In the TLexer::mSEMI method, the match routine is of particular significance. The code for match is predefined inside ANTLR sources; it's a part of the CharScanner class from which TLexer is derived in the generated code. Simple enough to understand, it just validates for the existence of a specific character in the input stream. Here's the code for match from the ANTLR sources:

virtual void match(int c) // Part of CharScanner class
  {
  int la_1 = LA(1);
  if ( la_1 != c )
    throw MismatchedCharException(la_1, c, false, this);
  consume();
  }

(Note the exception that this code is throwing in case of a mismatch. This concept will be used later in discussing the error-handling strategies.)

The code for TParser::declaration is fairly self-defining: You match the two token types and, if the match is negative, throw an exception. Note that INT and ID, originally defined in the lexer, correspond to token types of type integer in the generated code. Here's the code for the match routine for the parser. Note that the LA (look-ahead) method in the parser looks for the token type and not any character or character stream:

void Parser::match(int t) // t -> token type 
{
  if ( DEBUG_PARSER )  {
    traceIndent();
   ANTLR_USE_NAMESPACE(std)cout << "enter match(" << t << ") with    
   LA(1)=" << LA(1) << ANTLR_USE_NAMESPACE(std)endl;
  }
  if ( LA(1)!=t ) {
    if ( DEBUG_PARSER ) {
      traceIndent();
      ANTLR_USE_NAMESPACE(std)cout << "token mismatch: " << LA(1) 
          << "!=" << t << ANTLR_USE_NAMESPACE(std)endl;
      }
     throw MismatchedTokenException(getTokenNames(), getNumTokens(),   
     LT(1), t, false, getFilename());
  } else {
     // mark token as consumed -- fetch next token deferred until LA/LT
     consume();
  }
}

These code snippets also go a long way toward affirming what you already know: At the lexer level, the code deals with individual characters, while at the parser level, you're dealing with tokens and their types. It's not important for you to have a full understanding of what the above code is doing, but the overall idea helps you understand how error handlers in ANTLR work.

2 of 10 | Previous | Next

Comments



static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=AIX and UNIX
ArticleID=293295
TutorialTitle=Building custom language parsers
publish-date=03112008
author1-email=arpan@syncad.com
author1-email-cc=