The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

All information submitted is secure.

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

All information submitted is secure.

# Building custom language parsers

Solving common parsing problems with ANTLR

Arpan Sen is a lead engineer working on the development of software in the electronic design automation industry. He has worked on several flavors of UNIX, including Solaris, SunOS, HP-UX, and IRIX as well as Linux and Microsoft Windows for several years. He takes a keen interest in software performance-optimization techniques, graph theory, and parallel computing. Arpan holds a post-graduate degree in software systems. You can reach him at arpansen@gmail.com.
(An IBM developerWorks Contributing Author)

Summary:  There are certain things about ANTLR that, if understood, help in faster debugging and provide a fuller appreciation of how the tool works. Learn how to use ANTLR to create smarter parsing solutions.

Date:  11 Mar 2008

Activity:  17525 views

Get the background information you need to work with ANTLR.

Essential ANTLR facts

There are certain things about ANTLR that, if understood, help in faster debugging and provide a fuller appreciation of how the tool works. With this knowledge, you'll also be able to design smarter parsing solutions. Here are two absolutely essential ANTLR facts that you must know.

LL(k) parsing and the look-ahead strategy

ANTLR is an LL(k) parser—that is, it follows a top-down parsing algorithm, with k tokens of look-ahead parsing the input stream from left to right. How do you decide how many tokens of look-ahead are needed for parsing in ANTLR? Consider a simple grammatical rule that has to support the greater than (`>`) and greater-than-or-equal-to (`>=`) operators. The lexer must decide on which token to create for the parser when it hits upon the `>` character in the input stream. There are two ways to handle this situation:

• Using the `\$setType` predefined ANTLR function:  ```GT : '>' ('=' {\$setType(GTE);})? ; GTE : ">="; ```

The trick here is to make the lexer check for the next character when it hits `>`. If the next character is `=`, create and return a token of type GTE using the `setType` routine.
• Using `k=2;` in the `Options` section of the lexer: This specifies that the lexer will look at the next two characters of the input stream to determine which alternative to choose. The extra look-ahead is only used when actually necessary for disambiguation. (For example, in cases such as `>` and `<`, the extra look-ahead is not used.) However, exercise caution in deciding the look-ahead value: Generally, a larger k value results in a slower lexer.

Code generation

ANTLR takes in a grammar file (typically a file with a .g file extension) and generates `C++` or Java™ code. A casual look into the generated code reveals that it has created separate classes for both the lexer and the parser. Every grammar rule is now implemented as a `C++` or Java method that is part of the parser class, and every lexical identifier is likewise a method of the lexer class. Consider the following code snippet, which illustrates this fact for a single grammar rule and lexical identifier:

 ```options { language="Cpp"; // implies that code is generated in C++ namespace="preprocessor"; } class TParser extends Parser; declaration : INT ID; class TLexer extends Lexer; options { k=2;} tokens { INT="int"; } SEMI : ';' ; ID : ('a'..'z')+ ```

Two classes are generated for the above snippet: `TParser` and `TLexer`. The generated code has a method called `TParser::declaration` corresponding to the grammar rule declaration; likewise, the `TLexer::mSEMI` is generated for the lexer rule `SEMI`. Let's look into the generated code for `TLexer`:

 ```void TLexer::mSEMI(bool _createToken) { int _ttype; ANTLR_USE_NAMESPACE(antlr)RefToken _token; int _begin=text.length(); _ttype = SEMI; int _saveIndex; match(static_cast(';')); // .. some more ANTLR processing .. _returnToken = _token; _saveIndex=0; } ```

Likewise, for the `TParser::declaration`:

 ```void TParser::declaration() { try { // for error handling match(INT); match(ID); } catch (ANTLR_USE_NAMESPACE(antlr)RecognitionException& ex) { reportError(ex); consume(); consumeUntil(_tokenSet_0); } } ```

Run the above example, and look into the generated sources. In the `TLexer::mSEMI` method, the `match` routine is of particular significance. The code for `match` is predefined inside ANTLR sources; it's a part of the `CharScanner` class from which `TLexer` is derived in the generated code. Simple enough to understand, it just validates for the existence of a specific character in the input stream. Here's the code for `match` from the ANTLR sources:

 ```virtual void match(int c) // Part of CharScanner class { int la_1 = LA(1); if ( la_1 != c ) throw MismatchedCharException(la_1, c, false, this); consume(); } ```

(Note the exception that this code is throwing in case of a mismatch. This concept will be used later in discussing the error-handling strategies.)

The code for `TParser::declaration` is fairly self-defining: You match the two token types and, if the match is negative, throw an exception. Note that INT and ID, originally defined in the lexer, correspond to token types of type integer in the generated code. Here's the code for the `match` routine for the parser. Note that the `LA` (look-ahead) method in the parser looks for the token type and not any character or character stream:

 ```void Parser::match(int t) // t -> token type { if ( DEBUG_PARSER ) { traceIndent(); ANTLR_USE_NAMESPACE(std)cout << "enter match(" << t << ") with LA(1)=" << LA(1) << ANTLR_USE_NAMESPACE(std)endl; } if ( LA(1)!=t ) { if ( DEBUG_PARSER ) { traceIndent(); ANTLR_USE_NAMESPACE(std)cout << "token mismatch: " << LA(1) << "!=" << t << ANTLR_USE_NAMESPACE(std)endl; } throw MismatchedTokenException(getTokenNames(), getNumTokens(), LT(1), t, false, getFilename()); } else { // mark token as consumed -- fetch next token deferred until LA/LT consume(); } } ```

These code snippets also go a long way toward affirming what you already know: At the lexer level, the code deals with individual characters, while at the parser level, you're dealing with tokens and their types. It's not important for you to have a full understanding of what the above code is doing, but the overall idea helps you understand how error handlers in ANTLR work.

2 of 10

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=AIX and UNIX
ArticleID=293295
TutorialTitle=Building custom language parsers
publish-date=03112008