Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

Tips on designing a preprocessor for C++ using Antlr

Arpan Sen (arpansen@gmail.com), Independent author
Arpan Sen is a lead engineer working on the development of software in the electronic design automation industry. He has worked on several flavors of UNIX, including Solaris, SunOS, HP-UX, and IRIX as well as Linux and Microsoft Windows for several years. He takes a keen interest in software performance-optimization techniques, graph theory, and parallel computing. Arpan holds a post-graduate degree in software systems. You can reach him at arpansen@gmail.com.

Summary:  Learn how to use Antlr to create a C++ preprocessor. Using this approach to create the C++ compiler, you don't need a separate preprocessor engine. Instead, the preprocessor engine can be integrated as part of the lexer.

Date:  20 May 2008
Level:  Intermediate PDF:  A4 and Letter (47KB | 12 pages)Get Adobe® Reader®

Activity:  14380 views
Comments:  

One of the first steps in creating a custom compiler in C++ is developing the compiler’s preprocessor engine. Typically, C++ compilers have a separate preprocessing engine that accomplishes the following four basic tasks:

  • Defines and redefines macros (#define, #undef).
  • Supports header files (the #include directive).
  • Supports conditional compilation (#ifdef, #ifndef, #else, #endif).
  • Strips all comments from the input source listing.

Typically, the output of the preprocessor engine is then fed to the C++ lexer/parser combo. This article discusses the design of a C++ preprocessor using Antlr. You should have some familiarity with Antlr and the concept of top-down recursive descent parsers. All code discussed in this article is known to work on Antlr-2.7.2 and is compiled using gcc-3.4.4.

Create a single parser for the preprocessor

A salient benefit of using Antlr as your parser-generator tool of choice for creating the C++ compiler is that a separate preprocessor engine isn't necessary. The preprocessor engine may be integrated as part of the lexer. To understand this strategy, recall how lexers and parsers typically work. The lexer processes the raw data (in this case, the .h/.cpp files), creates tokens from this data, and passes the tokens to the parser. The parser in turn processes the tokens and validates against the grammar, does semantic checking, and ultimately generates assembly code.

The key to having a single compiler + preprocessor engine lies in proper modification of the lexer. The Antlr lexer is extended to preprocess the C++ sources and only pass the relevant tokens to the parser. The parser is never aware of the preprocessing part; it's only concerned with the tokens it receives from the lexer. Consider the following snippet:

#define USE_STREAMS
	  
#ifdef USE_STREAMS 
#include <iostream>
#else
# include <stdio.h>
#endif 
			

Assume that this snippet is declared as part of a C++ source file. The lexer needs to pass on to the parser the first token it finds by including the iostream header. The #define mapping and #ifdef conditional evaluation are the added responsibilities of the lexer.

Enhance the Antlr lexer by defining new tokens

Lexers that work on C++ language define tokens as per the C++ language standard. For example, you usually find tokens that represent variable names and language keywords defined in an average lexer file.

To combine the preprocessing engine with a lexer, you must define new token types corresponding to the preprocessor constructs. On encountering these constructs, the lexer takes the necessary action. The lexer must also strip down comments from the sources. In this case, on encountering the token for a comment, the lexer needs to ignore the token and continue looking for the next available token. It's important to note that these tokens aren't defined as part of the language standard; they're essentially implementation-defined tokens.

Define new tokens in the lexer

The lexer must contain the following tokens, along with language-mandated ones: COMMENT_TOKEN, INCLUDE_TOKEN, DEFINE_TOKEN, UNDEF_TOKEN, IFDEF_TOKEN, ELSE_TOKEN, and ENDIF_TOKEN. This article discusses a prototype lexer that supports the preprocessor macros mentioned earlier and variable declaration for integer and long data types. The barebones lexer/parser combo that processes these tokens is shown in Listing 1.


Listing 1. A bare-bones lexer/parser that supports preprocessor tokens and long/int declarations
                
header {
#include "Main.hpp"
#include <string>
#include <iostream>
}

options {
            language = Cpp;
}

class cppParser extends Parser;

start: ( declaration )+ ;
declaration: type a:IDENTIFIER   (COMMA b:IDENTIFIER)* SEMI;
type: INT | LONG; 

{
#include <fstream>
#include "cppParser.hpp"
}

class cppLexer extends Lexer;

options { 
            charVocabulary = '\3'..'\377';
            k=5;
}

tokens { 
  INT="int";
  LONG="long";
}
           
DEFINE_TOKEN: "#define"  WS macroText:IDENTIFIER WS  macroArgs:MACRO_TEXT;
UNDEF_TOKEN: "#undef" IDENTIFIER;  
IFDEF_TOKEN: ("ifdef" | "#ifndef") WS IDENTIFIER;
ELSE_TOKEN: ("else" | "elsif" WS IDENTIFIER);            
ENDIF_TOKEN: "endif";
INCLUDE_TOKEN:  "#include" (WS)? f:STRING;
    
COMMENT_TOKEN: 
  (
  "//" (~'\n')* '\n' { newline( ); } 
   |
  "/*" (
       {LA(2) != '/'}? '*' | '\n' { newline( ); } | ~('*'|'\n')
       )*
  "*/"
  );

IDENTIFIER: ('a'..'z'|'A'..'Z'|'_')('a'..'z'|'A'..'Z'|'_'|'0'..'9')* ;
STRING: '"'! ( ~'"' )* '"'!
SEMI: ';';
COMMA: ',';

WS :  ( ' ' | '\t' | '\f' | '\n' {newline();})+;
            
MACRO_TEXT:  ( ~'\n' )*   ;
		  

Define the lexer/parser combo to strip comments

Comments can be written in either the C++ // style or the C style /* */ multiline comment style. The lexer is extended to include a rule for the COMMENT_TOKEN. On encountering this token, the lexer needs to be explicitly told not to pass this token to the parser, but instead to skip it and continue looking for the next available token. You do this using Antlr's $setType function. You don't need to add support at the parser end, because it will never pass the comment token. See Listing 2.


Listing 2. Defining the lexer to strip comments from the C/C++ code
                
COMMENT_TOKEN: 
  (
  "//" (~'\n')* '\n' { newline( ); } 
   |
 "/*" (
           {LA(2) != '/'}? '*' | '\n' { newline( ); } | ~('*'|'\n')
         )*
  "*/"
  )
  { $setType(ANTLR_USE_NAMESPACE(antlr)Token::SKIP); };
		    

The code is simple enough for C++-style comments. For the C-style comment, note the {LA(2) != '/'}? used before matching *. It means that if there's a / after the *, then this rule isn't to be matched, and the lexer ends up matching */. This is necessary because /* this is */ an invalid comment */ is an invalid C++ comment. This also implies that the minimum look-ahead needed by the lexer is 2. LA(2) is known as a semantic predicate -- a hint to the lexer to decide what character to look for next in the input stream.


Include file processing

As discussed earlier, the parser needs to see a continuous stream of tokens. The lexer therefore needs to switch streams if and when it encounters an included file and switch back to the previous stream on encountering the end of the included file. The parser isn't aware of this stream switch and essentially treats included files as a continuous token stream. You can achieve this behavior several ways using Antlr; for a detailed discussion, see the Resources section.

This article uses Antlr's TokenStreamSelector class. Instead of initializing the parser with the lexer as in Listing 1, the parser is initialized with a TokenStreamSelector object. Internally, the TokenStreamSelector class maintains a stack of lexers. When the lexer encounters a new input stream, it creates a new lexer for processing of the new stream and subsequently attaches the new lexer to the tokenstreamselector object so that all future tokens (until the end of the new stream is reached) are sourced using the new lexer class. This is followed by invoking the retry method of the TokenStreamSelector class, which now sources tokens from the new input stream. The only other thing that remains to be done is to switch back to the previous input stream on encountering the EOF in an included file. To allow for user-defined actions on encountering the end of a line, Antlr provides the predefined routine uponEOF. You must modify this routine to switch to the previous input stream by calling TokenStreamSelector::pop(). Listing 3 shows a code snippet that includes file processing.


Listing 3. Initializing the parser with a TokenStreamSelector object
                
#include <iostream>
#include <fstream>
…
TokenStreamSelector selector;
cppParser* parser;
cppLexer* mainLexer;

int main(int argc,char** argv)
{
   try {
       std::ifstream inputstream("test.c", std::ifstream::in);
       mainLexer = new cppLexer(inputstream);
       // notify selector about starting lexer; name for convenience
       selector.addInputStream(mainLexer, "main");
       selector.select("main"); // start with main lexer

       // Create parser attached to selector
       parser = new cppParser (selector);
       parser->setFilename("test.c");
       parser->startRule();
  } 
  catch (exception& e) {
    cerr << "exception: " << e.what() << endl;
  }
  return 0;
} 
		    

The lexer-related changes are described in Listing 4.


Listing 4. Include file processing in the lexer
                
class cppLexer extends Lexer;
…

{
public:
    void uponEOF()  {
        if ( selector.getCurrentStream() != mainLexer ) {
            selector.pop(); // return to old lexer/stream
            selector.retry();
        }
        else {
             ANTLR_USE_NAMESPACE(std)cout << "Hit EOF of main file\n"  ;
        }
    }
}

INCLUDE_TOKEN:  "#include" (WS)? f:STRING
    {
    ANTLR_USING_NAMESPACE(std)
    // create lexer to handle include
    string name = f->getText();
    ifstream* input = new ifstream(name.c_str());
    if (!*input) {
        cerr << "cannot find file " << name << endl;
    }
    cppLexer* sublexer = new cppLexer (*input);
    // make sure errors are reported in right file
    sublexer->setFilename(name);
    parser->setFilename(name);

    // push the previous lexer stream and make sublexer current
    selector.push(sublexer);
    // ignore this token, re-look for token in the switched stream
    selector.retry(); // throws TokenStreamRetryException
    }
    ;
		  

Note that the TokenStreamSelector object is an intentional global variable because it needs to be shared as part of the lexer uponEOF method. Also, in the uponEOF method post, you must call the pop method retry so the TokenStreamSelector again looks into the current stream for the next token. Note that the include file processing described here isn't complete, but depends on conditional macro support . For example, if a source snippet includes a file within a #ifdef A .. #endif block where A isn't defined, then the processing for INCLUDE_TOKEN is discarded. This is further explained in the Handle macros and Conditional macro support sections.


Handle macros

Macro handling is primarily accomplished with the help of a hash table. This article uses the standard STL hash container. In its simplest form, support of macros implies supporting constructs like #define A and #undef A. You can easily do this by pushing the string "A" into a hash table on encountering #define and removing it from the hash on encountering #undef. The hash table is typically defined as part of the lexer. Also, note that the tokens for #define and #undef must not reach the parser -- for this reason, you add $setType(ANTLR_USE_NAMESPACE(Antlr)Token::SKIP); as part of the corresponding token processing. Listing 5 shows this behavior.


Listing 5. Supporting #define and #undef
                
class cppLexer extends Lexer;
…

{
bool processingMacro;
std::map<std::string, std::string> macroDefns;
public:
    void uponEOF()  {
    … // Code for include processing
    }
}
…
DEFINE_TOKEN: "#define" {processingMacro=true;} 
    WS macroText:IDENTIFIER WS macroArgs:MACRO_TEXT
  {
  macroDefns[macroText->getText()] = string(macroArgs->getText());
 $setType(ANTLR_USE_NAMESPACE(antlr)Token::SKIP);
  } ;
                           
UNDEF_TOKEN: "#undef" WS macroText:IDENTIFIER
  {
  macroDefns.erase(macroText->getText());
  $setType(ANTLR_USE_NAMESPACE(antlr)Token::SKIP);
  };

MACRO_TEXT: {processingMacro}? ( ~'\n' )* {processingMacro=false;};
		    


Conditional macro support

Broadly speaking, supporting conditional macros means that you need to account for which tokens reach the parser, depending upon the conditional. For example, look at the following snippet:

#define A
#ifdef A
#ifdef B
int c;
#endif
int d;
#endif
		    

On encountering the token for the first #ifdef, you must decide whether subsequent tokens that are encountered should be passed on to the parser, depending on whether A has been previously defined. At this stage, you use the hash table defined earlier.

Next, you must consider support of nested #ifdef blocks. Because you need to check for the path condition on encountering each #ifdef, it's logical to maintain a stack for the path conditions. Each #ifdef/#elsif adds the path condition to the stack head; the matching #endif removes the path condition from the stack.

Before you see the code that explains the concept in greater detail, you need to understand another important method in the Antlr lexer class: nextToken, which the parser calls continuously to retrieve the next token from the input stream. You can't modify this routine directly, so the optimal approach is to derive a class from the Antlr cppLexer class and redefine the nextToken method based on the path condition. If the path condition is true, the routine returns the next available token to the parser; otherwise, it continues until the matching #endif is found in the token stream.

Listing 6 shows the source for the derived lexer class. There should be no direct instantiation of cppLexer in the code; the parser should be initialized with a copy of the derived lexer class object.


Listing 6. Defining a new lexer class
                
class cppAdvancedLexer : public cppLexer 
  {
  public: 
    cppAdvancedLexer(ANTLR_USE_NAMESPACE(std)istream& in) : cppLexer(in) { }
    RefToken nextToken()
      {
      // keep looking for a token until you don't
      // get a retry exception
      for (;;) {
        try {
          RefToken _next = cppLexer::nextToken();
          if (processToken.empty() || processToken.top()) // defined in cppLexer
            return _next;
          }
        catch (TokenStreamRetryException& /*r*/) {
          // just retry "forever"
          }
        }
      }
  };
		  

The lexer-related changes are shown in Listing 7.


Listing 7. Conditional macro support in the lexer
                
{
public:
    std::stack<bool> processToken;
    std::stack<bool> pathProcessed;
    std::map<std::string, std::list<std::string> > macroDefns;
    void uponEOF() {
    … // Code for include processing defined earlier
    }
}

IFDEF_TOKEN {bool negate = false; } :
  ("#ifdef" | "#ifndef" {negate = true;} ) WS macroText:IDENTIFIER
  {
  bool macroAlreadyDefined = false;
  if (macroDefns.find(macroText->getText()) != macroDefns.end())
    macroAlreadyDefined = true;
  processToken.push(negate? !macroAlreadyDefined : macroAlreadyDefined);
  pathProcessed(processToken.top());
  $setType(ANTLR_USE_NAMESPACE(antlr)Token::SKIP);
  }
  ;
  
ELSE_TOKEN:  ("#else"
  {
   bool& pathCondition = processToken.top();
   pathCondition = !processToken.top(); // no other path is true
   } 
   | 
   "#elsif" WS macroText:IDENTIFIER
    {
     if (!processToken.top() && !pathProcessed.top())  {
       if (macroDefns.find(macroText->getText()) != macroDefns.end()) {
          processToken.push(true);
          pathProcessed.push(true);
       }
     }
     else {
        bool& condition = pathProcessed.top();
        condition = false;
     }
   )
   {
   $setType(ANTLR_USE_NAMESPACE(antlr)Token::SKIP);
   }
   ;

ENDIF_TOKEN: "#endif"
   {
   processToken.pop();
   pathProcessed.pop();
   $setType(ANTLR_USE_NAMESPACE(antlr)Token::SKIP);
   };

Examining the code, you see that processToken is defined as a stack of booleans to store the path conditions. On encountering #ifdef, if the code is already in true path, it tests whether the path condition is valid.


Support expressions as part of macros

You can use a macro to represent a complex expression that involves numbers, identifiers, mathematical operators, and even function calls. To do so, you must use the macroArgs string to replace every occurrence of the macro definition in the post-processed C++ code; the context determines whether this replacement is syntactically valid. For example, consider this situation: #define A b, c. Somewhere down the line you have int A;, which is valid C++; but switch(A) is clearly invalid C++. You need to parse the stream associated with the string on a per-occurrence basis and let the grammar rules determine the syntactic validity of the context.

The code for this approach works as follows:

  1. Capture the token corresponding to the macroText identifier in the cppAdvancedLexer::nextToken.
  2. Retrieve the macroArgs text from the hash table.
  3. Create an istream for macroArgs and a new copy of the lexer that parses this stream.
  4. Attach this lexer to the TokenStreamSelector object, and call retry -- this fetches the token from the new stream that was just created.

The uponEOF method is provided in cppLexer, which takes care of switching context on encountering the end of the stream. Listing 8 illustrates this discussion.


Listing 8. Modifying the nextToken method the lexer class for macro replacement
                
RefToken  cppAdvancedLexer::nextToken()
  {
  // keep looking for a token until you don't
  // get a retry exception
  for (;;) {
    try {
       RefToken _next = cppLexer::nextToken();
       map<string, string>::iterator defI = macroDefns.find(_next->getText());
       if (defI != macroDefns.end() && defI->second != "")
         {
         std::stringstream macroStream;
         cppAdvancedLexer* macrolexer = new cppAdvancedLexer(macroStream);
         macrolexer->setFilename(this->getFilename());
         selector.push(macrolexer);
         selector.retry();
         }
       if (processToken.empty() || processToken.top())
         return _next;
   }
   catch (TokenStreamRetryException& /*r*/) {
          // just retry "forever"
   }
}
		  		  		  


Conclusion

This article discussed the basic methodology you can follow to create a C++ preprocessor using Antlr. Developing a full-fledged C++ preprocessor is beyond the scope of a single article. Although this article doesn't deal with potential errors and error-handling strategies, a real-world preprocessor must account for these factors; for example, the ordering of tokens in conditional macro processing is significant, and you must take steps to ensure that it's handled correctly.


Resources

Learn

Get products and technologies

Discuss

About the author

Arpan Sen is a lead engineer working on the development of software in the electronic design automation industry. He has worked on several flavors of UNIX, including Solaris, SunOS, HP-UX, and IRIX as well as Linux and Microsoft Windows for several years. He takes a keen interest in software performance-optimization techniques, graph theory, and parallel computing. Arpan holds a post-graduate degree in software systems. You can reach him at arpansen@gmail.com.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=AIX and UNIX
ArticleID=309447
ArticleTitle=Tips on designing a preprocessor for C++ using Antlr
publish-date=05202008
author1-email=arpansen@gmail.com
author1-email-cc=mmccrary@us.ibm.com

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

For articles in technology zones (such as Java technology, Linux, Open source, XML), Popular tags shows the top tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), Popular tags shows the top tags for just that product zone.

For articles in technology zones (such as Java technology, Linux, Open source, XML), My tags shows your tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), My tags shows your tags for just that product zone.

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Try IBM PureSystems. No charge.

Special offers