Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

developerWorks Community:

  • Close [x]

Building custom language parsers

Solving common parsing problems with ANTLR

Arpan Sen (arpan@syncad.com), Lead Engineer, Synapti Computer Aided Design Pvt Ltd
Arpan Sen is a lead engineer working on the development of software in the electronic design automation industry. He has worked on several flavors of UNIX, including Solaris, SunOS, HP-UX, and IRIX as well as Linux and Microsoft Windows for several years. He takes a keen interest in software performance-optimization techniques, graph theory, and parallel computing. Arpan holds a post-graduate degree in software systems. You can reach him at arpansen@gmail.com.
(An IBM developerWorks Contributing Author)

Summary:  There are certain things about ANTLR that, if understood, help in faster debugging and provide a fuller appreciation of how the tool works. Learn how to use ANTLR to create smarter parsing solutions.

Date:  11 Mar 2008
Level:  Intermediate PDF:  A4 and Letter (86 KB)Get Adobe® Reader®

Activity:  17525 views
Comments:  

Optimizing your include file processing

In this section, learn how to optimize the performance of include file processing.

Performance optimization

The TokenStreamSelector object scheme works well, but you're still creating a new lexer for each include directive. It would be worth the effort to get the same thing done using a single lexer-parser combination, making your code extremely lean.

Note: The technique described here is specific to ANTLR version 2.7.2. Future versions of ANTLR might have internal data structures modified, and this method won't work. However, understanding this scheme will give you a better grasp of how ANTLR works internally: You can always adapt this method to future generations of the ANTLR tool.

Every ANTLR lexer maintains internal fields for the file name, current line number, column number, and the input stream (derived from std::ifstream). To use the same lexer across files, you must have a way of saving this data when you hit upon an include directive, resetting the internal lexer fields, and continuing to process the included file. On encountering EOF for the included file, you switch back to the previous file by restoring the data you saved earlier onto the lexer. Clearly, you need to define a structure that maintains these four fields and a stack of these structures. You also need a global input stream pointer to keep track of the current input stream. Here's the initial code:

#include <stack>

typedef struct LexerState { 
  int line, column;
  std::string filename;
  std::ifstream* input;
  LexerState() : line(0), column(0), input(0) { }
  LexerState(int lineNo, int colNo, std::string file,  std::ifstream* in) : 
    line(lineNo), column(colNo), filename(file), input(in) { }
  } LexerState;

std::stack<LexerState> LexerStateStack;
std::ifstream* gCurrentInputStream = 0; 

Now, on encountering the INCLUDE token in the lexer, perform the following steps:

  1. Populate a LexerState object from the current lexer.
  2. Push this LexerState object into the LexerStateStack.
  3. Reset the internal fields of the lexer.
  4. Change the gCurrentInputStream to point to the switched stream.

Listing 5 shows this process.


Listing 5. Switching input streams
                    
INCLUDE :  "#include" (WS_)? f:STRING
  {
  ANTLR_USING_NAMESPACE(std)
  string name = f->getText();
  std::ifstream* input = new std::ifstream(name.c_str());
  if (!*input) {
    cerr << "cannot find file " << name << endl;
  }
  // store the current input state
  LexerState state(this->getLine(), this->getColumn(), this->getFilename(),
          gCurrentInputStream);
  LexerStateStack.push(state);
        
  // reset the input state
  ((antlr::LexerInputState*)(this->getInputState()))->reset();
  this->setFilename(name);
  this->getInputState()->initialize(*input, name.c_str());
  gCurrentInputStream = input;
  parser->setFilename(name);
  };

Is this sufficient? Unfortunately, no.


Proper parser functioning

You're still missing out on an important premise: The parser is not aware of the stream switch. You must do two things for the parser to function properly, even after the stream is switched:

  • The INCLUDE token must not be passed to the parser. The parser has no grammar rule to handle an INCLUDE token. You need a way of skipping this token and moving on to the next token in the switched stream.
  • When you hit the end of an included file, switch back to the previous stream in a way that is opaque to the parser. You already know that the nextToken method of the lexer is the only thing the parser is aware of, so you must tweak it. In the ANTLR-generated code (discussed earlier), the lexer class already has a nextToken method. After encountering the INCLUDE token and subsequent stream switch, recall the nextToken method. Also note that directly modifying the generated nextToken method is a bad idea, because the method is overridden every time you run ANTLR on the grammar file. It's best, then, to derive a class directly from the TLexer class and modify the nextToken method inside to achieve this.

To solve problem 1, add the following piece of code inside the lexer rule for INCLUDE:

...
parser->setFilename(name); // from previous figure
$setType(ANTLR_USE_NAMESPACE(antlr)Token::SKIP);
                

This code makes sure that the lexer nextToken method in the generated code does not return the INCLUDE token to the parser and look again into the input stream for the next token. In effect, it is skipping the INCLUDE token.

To solve the problem 2, you derive a new class— MLexer —from the existing TLexer, then override the uponEOF and nextToken methods accordingly, as shown in .


Listing 6. The MLexer class
                    
class MLexer : public TLexer 
  {
  public: 
    MLexer(std::ifstream& in) : TLexer(in) { }
    void uponEOF() {
      if ( !LexerStateStack.empty() ) {
        LexerState state = LexerStateStack.top();
        LexerStateStack.pop(); 
        this->getInputState()->initialize(*state.input, state.filename.c_str());
        this->setLine(state.line);
        this->setColumn(state.column);
        gCurrentInputStream = state.input;
        throw ANTLR_USE_NAMESPACE(antlr)TokenStreamRetryException();
        }
      else {
	ANTLR_USE_NAMESPACE(std)cout << "Hit EOF of main file" << 
	  ANTLR_USE_NAMESPACE(std)endl;
      }
  }
 RefToken nextToken() {
   // keep looking for a token until you don't get a retry exception
      for (;;) {
        try {
          return TLexer::nextToken();
        }
        catch (TokenStreamRetryException& /*r*/) {
          // just retry "forever"
        }
      }
    }
  };

Note that the uponEOF method uses an exception-handling mechanism to return control to the MLexer::nextToken method. The TLexer::nextToken method doesn't catch TokenStreamRetryException, because the method isn't expected to skip tokens.


Modify the main routine

You must also modify the main routine. Instead of creating a TLexer object, you now create an object of type MLexer. The rest of the sources and the grammar file remain unchanged. Listing 7 shows the main routine.


Listing 7. The modified main method with the MLexer class
                    
int main(int argc,char** argv)
{
  try {
    std::ifstream inputstream("test.c", std::ifstream::in);
    MLexer* mainLexer = new MLexer(inputstream);
    mainLexer->setFilename("test.c");
    
    parser = new PParser(*mainLexer);
    parser->setFilename("test.c");
    gCurrentInputStream = &inputstream;
    parser->startRule();
  }
  catch(exception& e) {
    cerr << "exception: " << e.what() << endl;
  }
return 0;
}

Run the code in Listing 7 inside a good debugger. Observe in particular how the nextToken method works.

5 of 10 | Previous | Next

Comments



static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=AIX and UNIX
ArticleID=293295
TutorialTitle=Building custom language parsers
publish-date=03112008
author1-email=arpan@syncad.com
author1-email-cc=