Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

developerWorks Community:

  • Close [x]

Building custom language parsers

Solving common parsing problems with ANTLR

Arpan Sen (arpan@syncad.com), Lead Engineer, Synapti Computer Aided Design Pvt Ltd
Arpan Sen is a lead engineer working on the development of software in the electronic design automation industry. He has worked on several flavors of UNIX, including Solaris, SunOS, HP-UX, and IRIX as well as Linux and Microsoft Windows for several years. He takes a keen interest in software performance-optimization techniques, graph theory, and parallel computing. Arpan holds a post-graduate degree in software systems. You can reach him at arpansen@gmail.com.
(An IBM developerWorks Contributing Author)

Summary:  There are certain things about ANTLR that, if understood, help in faster debugging and provide a fuller appreciation of how the tool works. Learn how to use ANTLR to create smarter parsing solutions.

Date:  11 Mar 2008
Level:  Intermediate PDF:  A4 and Letter (86 KB)Get Adobe® Reader®

Activity:  17525 views
Comments:  

Processing include files

Learn how to work with include files in ANTLR.

Include file processing

It is an extremely common occurrence in computer programs for source files to include other source or header files. Software programming languages such as C and C++ have the #include directive to include another file; even hardware description languages like Verilog have the 'include directive, which behaves identically.

In contrast, note that the language standard makes no special provision for supporting the include construct: The grammar for the language is always built on the assumption that the parser would receive a continuous token stream from the lexer. This assumption makes life tough for the compiler developer. Evidently, there must be a separate preprocessing step prior to actual parsing, which flattens out the included file contents before the flattened file is provided as an input to the lexer. Included files could in turn be nested. For example, in C++, you could have some .cpp file containing some header file, which in turn could potentially include a bunch of other headers—all of which are from different folders.

Using ANTLR, there's no need for a separate preprocessing step. Here's a very small subset of C++ code as a potential grammar. It allows only int and long declarations and #include directives. Before providing the grammar file for this small subset, I should mention that your grammar would probably be different from a similar grammar file for, say, flex/bison or lex/yacc, which is based on preprocessing. More specifically, your grammar would have a lexical token for the include directive, and actions would be taken on encountering the directive. Listing 1 shows the bare-bones grammar file.


Listing 1. Basic grammar file
                    
header { 
#include <string>
#include <iostream>
}

options {
  language = Cpp;
}

class TParser extends Parser;

start: ( declaration )+ ;
declaration: type a:ID
                   { ANTLR_USE_NAMESPACE(std)cout << "decl " 
                       << a->getText() << ANTLR_USE_NAMESPACE(std)endl; }
                   (COMMA b:ID
                   { ANTLR_USE_NAMESPACE(std)cout << "decl " 
                       << b->getText() << ANTLR_USE_NAMESPACE(std)endl; }
                   )* SEMI; 
type: INT | LONG; 

{
#include <fstream>
#include <iostream>
#include "TParser.hpp"
}

class TLexer extends Lexer; 
tokens {
  INT="int";
  LONG="long";
}

options {
  charVocabulary = '\3'..'\377';
  k=2;
}

SEMI: ';';
COMMA: ',';
ID: ('a'..'z')+; 
INCLUDE: "#include" (WS)? STRING;
STRING: '"'! ( ~'"' )* '"'!
WS: (	' '      |	'\t'     |	'\f'
                       |   (options {generateAmbigWarnings=false;}
                           :	"\r\n"  // DOS
                           |	  '\r'     // Macintosh
                           |  '\n'     // UNIX
                           )
                     { newline(); }
          )+
        { $setType(ANTLR_USE_NAMESPACE(antlr)Token::SKIP); } ;


Strategies for handling the INCLUDE directive

The easiest thing to do is to create a new copy of the lexer and parser classes, TLexer and TParser, on encountering the INCLUDE directive, feed the lexer with the included file, and then pass the lexer information to the parser. The logic here is very simple: The included file's syntax is supposed to be the same as that of the file currently being parsed (C headers and source files have identical Backus-Naur Form [BNF] notation). So, simply creating a new lexer-parser combination that parses the included file and returns control to the current parser should do the trick. This automatically handles nested include files, as well (that is, included files themselves containing #include directives).

The code in Listing 2 is self-explanatory. There is an issue with this code, however. Can you spot it?


Listing 2. Individual lexer and parser per input stream
                    
INCLUDE :  "#include" (WS)? f:STRING
  {
  ANTLR_USING_NAMESPACE(std)
  // create lexer to handle include
  string name = f->getText();
  ifstream* input = new ifstream(name.c_str());
  if (!*input) {
    cerr << "cannot find file " << name << endl;
  }

 TLexer inclexer(*input);
 TParser incParser(inclexer);
 inclexer.setFilename(name);
 incParser.setFilename(name);

  incParser.startRule();
  $setType(ANTLR_USE_NAMESPACE(antlr)Token::SKIP);
  };

The code in Listing 2 works well, but the problem is that for every included file, a new lexer-parser pair is created, and that's never a good use of the available memory. In its most twisted form, N #include directives will result in N copies of TParser and TLexer in memory, thus resulting in a bad memory footprint. Fortunately for us, there's a smart way of tackling this issue in ANTLR: use of the built-in TokenStreamSelector class. Consider the code in Listing 3.


Listing 3. Include file processing using the TokenStreamSelector class
                    
#include <main.h> 
INCLUDE  :  "#include" (WS)? f:STRING
    {
    ANTLR_USING_NAMESPACE(std)
    // create lexer to handle include
    string name = f->getText();
    ifstream* input = new ifstream(name.c_str());
    if (!*input) {
        cerr << "cannot find file " << name << endl;
    }
    TLexer* sublexer = new TLexer(*input);
    // make sure errors are reported in right file
    sublexer->setFilename(name);
    parser->setFilename(name);

    // push the previous lexer stream and make sublexer current
    selector.push(sublexer);
    // ignore this token, re-look for token in the switched stream
    selector.retry(); // throws TokenStreamRetryException
    }
    ;

Contents of main.h: 
#ifndef _MAIN_HPP
#define _MAIN_HPP

#include "antlr/TokenStreamSelector.hpp"

class TParser;
class TLexer;

// Define a selector that can handle nested include files.
// These variables are public so the parser/lexer can see them.
extern ANTLR_USE_NAMESPACE(antlr)TokenStreamSelector selector;
extern TParser* parser;
extern TLexer* mainLexer;

#endif //_MAIN_HPP

So, what's happening here? Instead of attaching the lexer with the parser directly, as you did earlier, you've attached an object of type TokenStreamSelector to the parser. In turn, the TokenStreamSelector object has been initialized with the lexer object.

How does all this add up? The parser is always looking for the next token, and it doesn't matter who provides it. In contrast, it's possible to attach multiple lexer streams to the TokenStreamSelector class, and you can always switch to the desired input stream using the select method defined as part of the TokenStreamSelector class. For every include directive, you create a new TLexer object and switch the input stream accordingly; on hitting the end of file (EOF) for the stream, you switch back to the previous stream. Listing 4 shows the source code.


Listing 4. The uponEOF method in the lexer and the main.cpp source listing
                    
class TLexer extends Lexer;
…
options {
  charVocabulary = '\3'..'\377';
  k=2;
}

{
public:
    void uponEOF()  {
        if ( selector.getCurrentStream() != mainLexer ) {
            selector.pop(); // return to old lexer/stream
            selector.retry();
        }
        else {
             ANTLR_USE_NAMESPACE(std)cout << "Hit EOF of main file\n"  ;
        }
    }
}

main.cpp listing: 
#include <iostream>
#include "main.h"
#include "TLexer.hpp"
#include "TParser.hpp"

ANTLR_USING_NAMESPACE(std)
ANTLR_USING_NAMESPACE(antlr)
TokenStreamSelector selector;
TParser* parser;
TLexer* mainLexer;

int main(int argc,char** argv)
{
   try {
       std::ifstream inputstream("test.c", std::ifstream::in);
       mainLexer = new TLexer(inputstream);
       // notify selector about starting lexer; name for convenience
       selector.addInputStream(mainLexer, "main");
       selector.select("main"); // start with main lexer

       // Create parser attached to selector
       parser = new TParser(selector);
       parser->setFilename("test.c”);
       parser->startRule();
  } 
  catch (exception& e) {
    cerr << "exception: " << e.what() << endl;
  }
  return 0;
} 

A bit of explanation is in order, but before that, let's analyze the impact of the code. Even in the worst-case scenario of N nested #include directives, you have N copies of TLexer and only one copy of TParser. That's a significant savings of memory.

3 of 10 | Previous | Next

Comments



static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=AIX and UNIX
ArticleID=293295
TutorialTitle=Building custom language parsers
publish-date=03112008
author1-email=arpan@syncad.com
author1-email-cc=