Building a CDT-based editor, Part 3: Basic CDT parsing

Understanding the CDT parser and its abstract syntax trees

This article, third in a five-part "Building a CDT-based editor" series, introduces the parsing process used by the Eclipse C/C++ Development Tooling (CDT). Parsing is one of the CDT's most crucial functions, but because of its complexity, parsing is also one of its least-understood aspects. Many have asked if they can simply extract it for their own project, but here, we'll go further, explaining how the classes function and how this functionality fits in with the CDT as a whole.

Share:

Matthew Scarpino, Java Developer, Eclipse Engineering, LLC

Matthew Scarpino is a project manager and Java developer at Eclipse Engineering LLC. He is the lead author of SWT/JFace in Action and made a minor but important contribution to the Standard Widget Toolkit (SWT). He enjoys Irish folk music, marathon running, the poetry of William Blake, and the Graphical Editing Framework (GEF).



10 October 2006

Also available in Japanese Vietnamese

In Part 2 of this series, I explain how the CDT editor updates its text presentation by responding to each keystroke. But it does much more than simply display keywords in a specific color and font: It also analyzes the structure of the code and keeps track of every function, statement, and variable.

This analysis, called parsing, is a vast subject that continues to provide computer scientists with new avenues of research. I'll briefly explain some of the theory behind parsing, but I'll focus on the mechanics of the CDT's operation. My goal is to provide enough information so that, if you'd like to improve or modify the CDT parser, you know which classes and methods to alter, and how to alter them.

I've added these classes to the Bare Bones C/C++ Development Tool (BBCDT) so you can see how parsing is performed by example. Most of the new classes deal with the CDT parsing process, so you'll find them in the org.bbcdt.dworks.core plug-in (specifically, the org.bbcdt.dworks.core.parser package and its subpackages).

It's important to keep in mind that the CDT actually has two parsers -- one that uses a Persisted Document Object Model (PDOM) and one that doesn't. Both are important for the current (V3.1) CDT, but according to Doug Schaefer, the PDOM parser will gradually take over the responsibilities of the second. I describe the PDOM parser in Part 4, but because the second parser is easier to understand, I'll cover it here. Specifically, I'll discuss what happens in the CDT from the time the Document receives an update to the creation of a new Abstract Syntax Tree (AST). This process can be best explained in four parts:

The Reconciler and its ReconcilingStrategy
How the parser receives events from the Document
Building the parser
How the parser is created and initialized
The parsing process
How the parser analyzes the structure of WorkingCopy text
The AST
The parser's model of the source code and how you can access it

The Reconciler and its ReconcilingStrategy

Part 2 describes how the PresentationReconciler begins the process of styling text by responding to changes to the edited document. Similarly, CDT parsing starts with a Reconciler object, which listens for the same type of events. It's important not to get these two classes confused. The PresentationReconciler class updates the text's appearance with every keystroke, and the Reconciler class runs in a daemon thread, parsing the document without holding up the user interface (UI). This is why, when you use Eclipse, syntax coloring works so much faster than error detection.

The Reconciler's thread starts when its listener detects a new Document. When the user updates the Document, the thread tells its CReconcilingStrategy to start reconciling. The CDT Reconciler is a MonoReconciler, which means that it can have only one strategy. Also, the Reconciler is not incremental, which means that the strategy operates on the entire Document whenever a change occurs.

The CReconcilingStrategy accesses the WorkingCopy and tells it to organize its children within a WorkingCopyInfo object, which functions like the other info objects in the CDT Model. But the WorkingCopy doesn't have any children, yet -- it's just a buffer of unsaved, unstructured text. To create order out of this chaos, the WorkingCopy calls its parse() method. This call doesn't start the parsing just yet, but it begins the process of creating and initializing the parser.


Creating the parser

The WorkingCopy starts the parser creation by constructing a CModelBuilder object to manage the parsing process. This object determines the language being parsed from the project's nature and builds a character buffer from the WorkingCopy's text. It also creates a series of objects (IProblemRequestor, IScannerInfoProvider, CodeReader, SourceElementRequestor, and ParserLogService) that it uses to construct a Scanner2 object by calling the ParserFactory.createScanner() method. This method analyzes the characters in the buffer and provides tokens to the parser.

Using the parser outside the CDT

The process of creating the parser may seem involved, but there's an important point to notice. The only relationship between the parser and the CDT Model is the buffer of the WorkingCopy's characters. Therefore, you can access the CDT's parser without all the baggage in the CDT Model. But because the Document Object Model (DOM)-based parser interacts with the CDT Model, you can't just add the org.eclipse.cdt.parser plug-in to your project. Instead, extract the Parser and the classes it depends on, and then call the ParserFactory.createScanner() and ParserFactory.createParser() methods from within your code.

After the scanner is created, the CModelBuilder constructs the Parser by calling the ParserFactory.createParser() method. This method tells the ProblemRequestor to start updating the editor's annotations when errors occur. Then, it calls the Parser's parse() method.

This is where the fun starts. But before I go further, let me make a quick detour into the theory of parsing.

Intermission: A brief introduction to parsing

Consider the input sentence, "Matt likes pizza." You might know the individual words, but if you aren't familiar with the subject-verb-object sentence structure, you won't understand its meaning. The sentence makes sense only if you can match the elements of the input to those in the abstract structure ("Matt" = subject, "likes" = verb, "pizza" = object). Of course, there are many other sentence structures and many other languages. The body of rules that specify how proper sentences are constructed in a given language is called the language's grammar.

With a programming language, the grammar contains an abstract model that all well-formed units of code (that is, source files) must match. The first step in performing this match is called scanning, or lexical analysis. A scanner reads in the individual characters of a buffer and returns symbols (called tokens) that the parser will understand -- such as keywords, operators, and language-specific punctuation. For example, when a C scanner reads in the characters c, o, n, s, and t, it returns a single token object representing the keyword const. The parser doesn't care about whitespace or comments, so the scanner disregards them.

The second step of the match process is parsing. Here, the parser tests combinations of tokens to see how they compare to the abstract elements in its grammar. Generally, this is much more complex than the subject-verb-object example above. For example, if a statement starts with a token corresponding to "enum" (declaring an enumeration), the CDT parser will try to match the statement to its abstract model of an enum statement. That is, after the ENUM token, it will look for an optional name, then an enumerated_list between L_BRACE and R_BRACE tokens, and a final enumerated_list. Each italicized term refers to another abstract element in the grammar.

Many parsers, including the CDT parser, do more than simply check for well-formedness. After they've parsed the input, they store its structural information in a form suitable for analysis, indexing, and searching. This form is usually that of a tree, called an AST. For example, if you tokenized a developerWorks article and sent it through an English-language parser, it might produce a tree structure like that shown in Figure 1.

Figure 1. Example AST
Example AST

As you can see, the elements in the tree are general at the top (developerWorks Article) and become more particular with each downward node. If I had the space, I'd show how the nodes at the bottom contain sentences and finally specific words. For a programming language, these terminal nodes contain individual keywords, operators, and variables. It's much easier to find a function or variable name by searching an AST than it is to re-parse the entire document. Now, I show how the CDT parser generates the AST and how you can traverse it with your own code.


The parsing process

I haven't found any grammar files for the CDT, but we can determine its abstract model elements from the methods in the Parser class and from the comments that precede these methods. This class is located in the org.dworks.bbcdt.internal.core.parser package. As mentioned, the first parsing method is translationUnit() because the TranslationUnit element represents the entire source file, just as the Article element in my example represents an entire article.

A cryptic statement in the comment precedes the translationUnit() method:

translationUnit : (declaration)*

If you're familiar with the Extended Backus Naur Form (EBNF), you'll know that this rule means that a TranslationUnit element consists of any number of Declaration elements. These comments don't provide complete CDT grammar, but if you're trying to use or modify the Parser class, you'll them helpful.

Because the Declaration element is just beneath TranslationUnit, the translationUnit() method calls the declaration() method. The EBNF rule tells us that a declaration can take one of six forms:

Assembly statement
The asm token followed by an ASMDefinition element
Namespace statement
The namespace token followed by an NamespaceDefinition element
Using statement
The using token followed by an UsingDeclaration element
Template statement
The template or export token followed by an TemplateDeclaration element
Linkage statement
The extern token followed by an LinkageSpecification element
Simple statement
A SimpleDeclaration element

To determine the type of a given declaration, the declaration() method uses a switch statement whose execution depends on the type of the scanner's next token. Each token implements IToken and stores its type (an int between 1 and 141), the file being scanned, and its offset and length in the scanned Document.

If the declaration doesn't fall into one of the first five categories, the simpleDeclaration() method is called, along with a guess as to the nature of the declaration. The first guess is that the declaration is a constructor, but if the method throws a BackTrackException, the next guess is that it's a function declaration. If that fails, the method is called a third time, this time guessing it's a variable declaration. If this throws another exception, the method returns null.

The parser continues to call methods to match code with model elements, but the depth of analysis depends on its mode. The five parsing modes are:

QUICK_PARSE
Does not parse inside functions or included files
STRUCTURAL_PARSE
Does not parse inside functions but parses included files
COMPLETE_PARSE
Parses inside functions and included files
COMPLETION_PARSE
Parses inside functions and included files, stops at offsets, and optimizes symbol query lookups
SELECTION_PARSE
Parses inside functions and included files, stops at offsets, and provides semantic information about a selected range

The mode also determines what callback object the Parser uses to store information during the parse. In QUICK_PARSE mode, a QuickParseCallback keeps track of macros, functions, declarations, and any errors that occur. In any other mode, a StructuralParseCallback stores such additional information as the Parser's current scope, variables, namespace statements, enum declarations, and more. But because additional analysis requires extra time, QUICK_PARSE is the Parser's default mode.

No matter which callback object you use, the most important information it carries is an ASTCompilationUnit. I describe the significance of this next.


The AST

The first parsing method, translationUnit(), begins its operation by creating a ASTCompilationUnit object. This object is not only the highest node in the AST but also the container of all the nodes beneath it. When the Parser recognizes a declaration in the source, it adds a subclass of ASTDeclaration to the ASTCompilationUnit's List of declarations. If the first declaration in a source file is a typedef, the first element in the ASTCompilationUnit's List will be an ASTTypedefDeclaration.

Similarly, each ASTDeclaration object stores AST objects that form the declaration. As an example, a function declaration, represented by an ASTFunction, contains AST objects representing its return type, parameters, and any functions contained within it. When the Parser finishes its analysis, the ASTCompilationUnit will contain a complete AST representing the file's structure. At that point, finding a particular element in the file is as simple as a tree traversal algorithm.


Updating the BBCDT

If you open the BBCDT project in Eclipse, you'll see there's a great deal more code, particularly in the org.dworks.bbcdt.core plug-in project. Many of these new classes are necessary for setup and performing parsing. But the majority of them represent elements of a C/C++ source file or a C/C++ AST.

The org.dworks.bbcdt.ui plug-in also has new classes. Most of them deal with the reconciling process, but an ASTAction class in the org.dworks.bbcdt.ui.action package implements the IEditorActionDelegate interface. It appears when the BBCDT editor is visible. When you click its toolbar contribution, it accesses the IASTCompilationUnit and iterates through its declaration objects. If any of them declare functions or variables or begin with namespace, using, or extern, the ASTAction displays a window that lists the type of each declaration with specific information for that declaration type. Listing 1 shows the code that makes this possible.

Listing 1. Accessing declarations in the CDT AST
IASTCompilationUnit unit = CCorePlugin.getCompilationUnit();
Iterator iter = unit.getDeclarations();
    while(iter.hasNext()) {
        IASTDeclaration decl = (IASTDeclaration)iter.next();

        if (decl instanceof IASTFunction)
            output += "Function declaration: " + 
                ((IASTFunction)decl).getName();
					
        else if (decl instanceof IASTLinkageSpecification)
            output += "Linkage declaration: " + 
                ((IASTLinkageSpecification)decl).getLinkageString();
					
        else if (decl instanceof IASTNamespaceDefinition)
            output += "Namespace definition: " + 
                ((IASTNamespaceDefinition)decl).getName();
					
        else if (decl instanceof IASTUsingDeclaration)
            output += "Using declaration: " + 
                ((IASTUsingDeclaration)decl).getName();
					
        else if (decl instanceof IASTVariable)
            output += "Variable declaration: " + 
                ((IASTVariable)decl).getName();

You won't find this code anywhere in the CDT; I cobbled it together to make it easy to access and traverse the IASTCompilationUnit. By altering the ASTAction, you can search through the nodes and create the entire tree. Figure 2 shows the BBCDT's output after you click the ASTAction button.

Figure 2. Output of the BBCDT AST analysis
Output of the BBCDT AST analysis

Conclusion

This article explains how the CDT Parser responds to changes in the editor's text, then creates an AST with the C/C++ elements. It also summarizes the general theory of parsing. But there's much, much more to parsing; see Resources for an excellent online parsing textbook.

The Parser I've described serves its purpose, but it has drawbacks, as well. It's slow, and its analysis is limited to one source file at a time. As shown in Part 4, the CDT already incorporates another parser that is more powerful and more complex.


Downloads

DescriptionNameSize
Sample plug-in filesos-ecl-cdt3-BBCDT_Plugins.zip955KB
Sample plug-in filesos-ecl-cdt3-BBCDT_Projects.zip2MB

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Open source on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Open source
ArticleID=165242
ArticleTitle=Building a CDT-based editor, Part 3: Basic CDT parsing
publish-date=10102006