Why learn about another parser?
External linkage is one of the many features of C/C++ that make it difficult to analyze code. No matter where you declare a variable or function, the extern keyword allows other files to use it without knowing where it was defined. When it comes to displaying this interdependence, Eclipse's Outline view, shown on the left of Figure 1, isn't helpful. It only shows elements of single source files. On the right, however, the Index view presents each element of every file in every CDT project. If you're trying to make sense of complex, interconnected projects, this second approach is vital.
Figure 1. Outline and Index views
You could build an Index view using the parser described in Part 3, but it would have to parse every source file in the CDT workspace whenever any code is modified. That isn't practical.
To make capabilities like the Index view possible, CDT developers created a new parsing process that stores its results in memory-mapped files. This persistence enables the parser to access stored elements in a project while only analyzing files that have recently changed. In addition to providing the Index view, this improved parsing also makes possible features such as code completion and content assist.
The new parsing process doesn't make the discussion in Part 3 irrelevant. The new parsers function essentially the same as the old Parser. They contain similar methods and create similar abstract syntax trees (ASTs). However, much of the new process has changed. I'll summarize its operation in four steps:
- Starting the indexer job
- Parsing a TranslationUnit
- Updating the PDOM
- PDOM Structure and Storage
So you can see how these concepts are implemented in code, I updated the Bare Bones C/C++ Development Tooling (BBCDT) to create an index of the project's elements and display specific node types. To activate it, I updated the previous ASTAction to visit elements in the AST and the PDOM's storage. (See Download for the code.)
Step 1: Starting the indexer job
When the core plug-in starts, it creates a PDOMManager to oversee the new parsing and persistence. The manager does little on its own, but it responds when new CDT elements -- particularly CProjects and TranslationUnits -- appear in the workspace. When a new CProject is created, the PDOMManager creates an indexer to find and store the elements in the project's source files.
The indexer implements IPDOMIndexer, but its class depends on the user's CDT preferences. Figure 2 shows the three indexing options with No Indexer selected by default. When you select No Indexer, the PDOMManager creates a NullPDOMIndexer that does nothing. The PDOMManager creates a FastPDOMIndexer if you select Fast C/C++ Indexer and a FullPDOMIndexer if you select Full C/C++ Indexer. After the indexer is created, the manager tells it to start building the index.
Figure 2. Indexer preferences
The goal of indexing is to record all the named elements in the project's source code and store them with their related data. This is a complicated process and can run in the background, so it's performed as an Eclipse Job. Specifically, the indexer creates an indexing Job with a priority of Job.LONG and adds it to the platform's queue.
If the project has just been created, the Job doesn't accomplish much because there's no code to index. Nevertheless, it does build a very important object for the project called a PDOM. This provides the routines that store the indexer's data within the PDOM.
Unlike the Parser in Part 3, the indexer doesn't begin immediately or automatically. Instead, the PDOMManager waits for changes to the CProject, such as a new, modified, or deleted source file. When it detects a change, it tells the indexer to start analyzing, and this time, the indexer Job works in earnest. In particular, it finds the altered TranslationUnit and the ILanguage that represents the language of the unit's source code. Then it starts the parsing process by calling ILanguage.getASTTranslationUnit().
Step 2: Parsing the TranslationUnit
The Parser from Part 3 makes no distinction between programming languages. It looks for C and C++ elements and uses them to create a generic AST. The new process works differently. Each TranslationUnit has an ILanguage object that specifies, among other things, how the unit should be parsed. If you'd like to add a custom language to the CDT, creating your own ILanguage is the place to start. Thankfully, the CDT provides two by default: GCCLanguage and GPPLanguage.
Similarly, the CDT provides two default parsers that implement ISourceCodeParser: GNUCSourceParser and GNUCPPSourceParser. The parsing process itself is the same as that described in Part 3. The code reader creates a character buffer from the source file, the scanner combines characters into ITokens, and the parser combines ITokens into IASTNodes. Even the method names (translationUnit(), declaration(), statement(), and others) are the same. The following are the main differences between the new parsing process and the old:
- The new parsers are language specific -- nodes extend
CASTNodeorCPPASTNode. - The new parsers always run in
COMPLETE_PARSEmode, so they always parse inside functions. - The new parsers store additional parsing information (problems, node locations) in an
ILocationResolverinstead of a callback object. - The new parsers only determine an element's scope when it is requested.
- The top node in a new AST is an
IASTTranslationUnit, whose structure can be traversed byASTVisitors.
Remember: The old Parser returned an IASTCompilationUnit. The PDOM parsers return IASTTranslationUnits, or, more specifically, CASTTranslationUnits and CPPASTTranslationUnits.
The fourth difference listed above is important. The parser can't determine how an element is used unless it knows the element's scope. The old Parser creates a top-level scope when it starts the parse with translationUnit(). Then it passes the scope to further methods, such as declaration(), that create and pass along subscopes as necessary. When the old Parser finishes, it has the scope of every element in its AST. The new parser creates a single scope and doesn't change it during the course of its operation. This is because the indexer is only interested in a specific set of nodes. To explain this and other interesting features, I need to describe what the indexer does with the parser's AST and how it interacts with the PDOM.
After the indexer receives the IASTTranslationUnit, it gets a write lock on the PDOM and begins persisting the TranslationUnit's data. First, it calls IASTTranslationUnit.getIncludeDirectives() and IASTTranslationUnit.getMacroDefinitions(). Then, it stores a reference to the unit's file in the PDOM, along with references to the included directives and macro definitions. These references take the form of PDOMFiles that can be stored with PDOM.addFile().
To search through, or visit, the AST, the indexer creates a new ASTVisitor and calls IASTTranslationUnit.accept(ASTVisitor). Each of the IASTNodes in the AST has an accept() method, and each works in much the same way:
- It checks one of the
ASTVisitor's Boolean fields to determine if the visitor is interested in visiting this node. - If the visitor is interested, the node calls the visitor's
visit()method, which returns one of three values: PROCESS_SKIP-- The visitor isn't interested in the node's children, soaccept()returns true.PROCESS_ABORT-- The visitor isn't interested in visiting at all, soaccept()returns false.PROCESS_CONTINUE-- The visitor wants to visit the node's children, soaccept()continues to the next step.- If the Boolean field is false or
visit()returnsPROCESS_CONTINUE,accept()calls theaccept()method of each of its children and passes the visitor as its argument.
For example, if a C function declaration has a declaration specifier, such as typedef or static, its corresponding CASTFunctionDefinition.accept() method checks the visitor's shouldVisitDeclarations field. If the field is true, it calls the visitor's visit() method. If the field is false or visit() returns PROCESS_CONTINUE, the function node accesses its child and calls CASTSimpleDeclSpecifier.accept(). When the visitor reaches one of the final, or terminal, nodes, accept() simply returns true.
Table 1 shows the Boolean fields that control which types of nodes are allowed to call the visitor's visit() method. By default, all of them are set to false, so visit() is never called.
Table 1. The ASTVisitor fields that control node visitation
| shouldVisitNames | shouldVisitDeclarations | shouldVisitInitializers |
| shouldVisitDeclarators | shouldVisitDeclSpecifiers | shouldVisitExpressions |
| shouldVisitTypeIds | shouldVisitEnumerators | shouldVisitTranslationUnit |
| shouldVisitParameterDeclarations | shouldVisitStatements | shouldVisitProblems |
The best way to understand how ASTVisitors work is to look at an example. Listing 1 shows the visitor used by the PDOMFullIndexerJob to find IASTName nodes. To visit these nodes, it sets shouldVisitNames and shouldVisitDeclarations to true.
Listing 1. The PDOMFullIndexerJob's ASTVisitor
ast.accept(new ASTVisitor() {
// Set the Boolean fields so that name and declaration nodes call visit()
{
shouldVisitNames = true;
shouldVisitDeclarations = true;
}
// Perform the visit
public int visit(IASTName name) {
try {
IASTFileLocation fileloc = name.getFileLocation();
if (fileloc != null) {
PDOMFile file = pdom.addFile(fileloc.getFileName());
linkage.addName(name, file);
}
return PROCESS_CONTINUE;
} catch (Throwable e) {
CCorePlugin.log(e);
return ++errorCount > MAX_ERRORS ? PROCESS_ABORT : PROCESS_CONTINUE;
}
};
});
|
As shown, the indexer's visitor is only interested in IASTNames, which are the only nodes stored in the PDOM. These represent specific identifiers, such as the names of functions, variables, and enumerated types. When the visitor reaches one of these nodes, it accesses a PDOMFile object representing the name's source file. Then it uses the PDOM's linkage object (PDOMLinkage) to add the IASTName to the index.
Before the IASTName can be stored, the linkage needs to know its type, or binding. For instance, whether it's the name of a variable, function, etc. To find out, the linkage finds the node's parent and its scope. With this scope, the linkage figures outs how the name is used and encapsulates that information in an IBinding. This is why the new parsing process makes scope determinations after the parse -- it's only interested in the scopes of IASTNames.
To clarify names and bindings, I debugged the CDT's parsing of the Hello World program and provided the AST breakdown shown in Figure 3. The AST contains a single IASTDeclarationUnit that represents the main() function. The CASTFunctionDeclarator holds a CASTName whose IBinding is a CFunction.
Figure 3. Hello World AST
The linkage adapts the IBinding into a persistible PDOMBinding and the IASTName into a PDOMName. The constructors for these new objects access the PDOM to store their information. To show how this works, I need to back up and explain how the PDOM operates.
If you've created any CDT projects, I recommend you open your workspace directory and the .metadata/plugins/org.eclipse.cdt.core directory underneath it. There, you'll see a *.pdom file for each of your projects. When the PDOM is created for a project, it constructs a Database object to manage its low-level storage. The Database builds a new RandomAccessFile and gives it the project name and the time in milliseconds -- these are the *.pdom files you see.
This Database has more in common with regular databases than just the name. It divides memory into variable-length records composed of 16-byte blocks. Each record represents a different PDOMNode, such as a PDOMLinkage or PDOMBinding. To enable memory access, it provides put and set methods for ints, chars, and bytes.
To make reading and writing to the file as efficient as possible, the Database uses an array of Chunks that manage MappedByteBuffers on the *.pdom file. When the Database creates this file, it sets its size to 16 KB and allocates the first Chunk in the array to buffer. The Database creates further 16 KB Chunks as the index requires more memory.
The first object added to the database is the PDOMLinkage. New linkages are added as new languages are used in the project. When its constructor is called, the PDOMLinkage:
- Sets its initial record size at 24 bytes.
- Stores a 0 to represent the node's type (linkage).
- Stores a 0 to show that it has no parent node.
- Stores the size of memory needed to store the linkage's name.
- Allocates the memory and stores the linkage's name.
- Stores the size of memory needed to store the linkage's ID.
- Allocates the memory and stores the linkage's ID.
- Stores its record size at a specific address, if this is the first project linkage. If it isn't the first project linkage, it finds the next available address and stores the size there.
The memory allocation is performed by Database.malloc(int size), which starts by determining the minimum number of blocks needed to provide the requested memory. If there isn't enough memory available, it creates a new Chunk. Otherwise, it accesses the Chunk that can provide the memory, clears the required number of blocks, and returns the location of the first allocated block.
The storage process is similar for other objects, such as when the linkage creates a new PDOMBinding as part of storing a named node. If the binding isn't contained within the scope of another binding, the linkage sets itself as the binding's parent. Then, the PDOMBinding constructor sets its initial record size at 24 bytes and performs most of the same steps as those listed above.
The PDOMName works differently. There are three ways a name can be associated with its binding: as a definition, declaration, or reference. When the PDOMName determines which role it plays, it updates the PDOMBinding's record. Then it stores the location of the PDOMFile whose file contains it, along with its location and length. Finally, it populates the file's index by adding its block location to the PDOMFile.
Because of the parent/child relationship of PDOMNodes, the index forms a tree similar to the parser's AST. The first time PDOMLinkage.getIndex() is called, it creates a BTree using the PDOM's Database and the root block location. You can search through this tree with an IPDOMVisitor. This functions like the ASTVisitor with three important differences:
- It has no Boolean fields -- all nodes are visited.
- It uses only a single
visit(IPDOMNode node)method, which checks if the visited node is of a certain type. - It contains a
leave(IPDOMNode node)to stop search when a particular node type is reached.
You can also search through the PDOM using the static methods provided in the DOMSearchUtil class.
I've added the PDOM classes, and considering the size of the application, Bare Bones just doesn't seem appropriate anymore. I've set the PDOMManager to create only PDOMFullIndexers for project indexing, so there are no preferences or descriptors involved. Also, I changed the ASTAction to perform two tasks: search through the IASTTranslationUnit with a BBASTVisitor and search through the PDOM with a BBPDOMVisitor. These classes are in the org.dworks.bbcdt.ui.action package. The result of clicking ASTAction is shown in Figure 4.
Figure 4. Contents of the BBCDT index
The BBASTVisitor searches through the AST finding IASTNames and IASTForStatements. When it discovers a loop, it displays three expressions: the initializer, condition, and iteration. The BBPDOMVisitor finds functions in the BBPDOMVisitor and displays the number present.
Regarding parsing, Java NIO, and pseudo-database access, there's no question that indexing is a complicated process. However, this complexity is needed to provide efficient, all-embracing code analysis and storage.
Now that I've explained the PDOM, it's time to see how the CDT applies it. In Part 5, I discuss code completion and content assist, including how they work and how they access the PDOM index.
| Description | Name | Size | Download method |
|---|---|---|---|
| BBCDT code for Part 4 | os-ecl-cdt4.zip | 512KB | HTTP |
Information about download methods
Learn
-
Check out Eclipse C/C++ Development Tooling (CDT) at Eclipse.org.
-
Read Parsing Techniques -- A Practical Guide, by Dick Grune and Ceriel J.H. Jacobs online, courtesy of The University of Amsterdam.
-
Read Doug Schaefer's blog, the leader of the CDT development team.
-
Learn more about the Eclipse Foundation and its many projects.
-
For an excellent introduction to the Eclipse platform, read "Getting started with the Eclipse Platform."
-
Expand your Eclipse skills by checking out IBM developerWorks' Eclipse project resources.
-
Browse all of the Eclipse content on developerWorks.
-
Visit the developerWorks Open source zone for extensive how-to information, tools, and project updates to help you develop with open source technologies and use them with IBM's products.
-
Stay current with developerWorks technical events and webcasts.
-
Visit Safari Books Online for a wealth of resources for open source technologies.
-
developerWorks offers interesting podcast interviews and discussions for software developers.
Get products and technologies
-
Download the Eclipse CDT.
-
Check out the latest Eclipse technology downloads at IBM alphaWorks.
-
Innovate your next open source development project with
IBM trial software, available for download or on DVD.
Discuss
-
The Eclipse newsgroups offer many resources for people interested in using and extending Eclipse.
-
Get involved in the developerWorks community by participating in developerWorks blogs.
Matthew Scarpino is a project manager and Java developer at Eclipse Engineering LLC. He is the lead author of SWT/JFace in Action and made a minor but important contribution to the Standard Widget Toolkit (SWT). He enjoys Irish folk music, marathon running, the poetry of William Blake, and the Graphical Editing Framework (GEF).




