Building a CDT-based editor, Part 4: Advanced CDT parsing and the Persisted Document Object Model

Understanding CDT indexing and the PDOM

This article, the fourth in a five-part "Building a CDT-based editor" series, introduces the second and more sophisticated of the parsers used by Eclipse's C/C++ Development Tooling (CDT). This new process structures its information in a Persisted Document Object Model (PDOM) and enables indexing, code completion, and content assist. If you intend to improve or extend the CDT for your own custom tool, understanding the PDOM and the new parsing is essential.

Share:

Matthew Scarpino, Java Developer, Eclipse Engineering, LLC

Matthew Scarpino is a project manager and Java developer at Eclipse Engineering LLC. He is the lead author of SWT/JFace in Action and made a minor but important contribution to the Standard Widget Toolkit (SWT). He enjoys Irish folk music, marathon running, the poetry of William Blake, and the Graphical Editing Framework (GEF).


developerWorks Contributing author
        level

24 October 2006

Also available in Japanese Vietnamese

Why learn about another parser?

External linkage is one of the many features of C/C++ that make it difficult to analyze code. No matter where you declare a variable or function, the extern keyword allows other files to use it without knowing where it was defined. When it comes to displaying this interdependence, Eclipse's Outline view, shown on the left of Figure 1, isn't helpful. It only shows elements of single source files. On the right, however, the Index view presents each element of every file in every CDT project. If you're trying to make sense of complex, interconnected projects, this second approach is vital.

Figure 1. Outline and Index views
Outline and Index views

You could build an Index view using the parser described in Part 3, but it would have to parse every source file in the CDT workspace whenever any code is modified. That isn't practical.

To make capabilities like the Index view possible, CDT developers created a new parsing process that stores its results in memory-mapped files. This persistence enables the parser to access stored elements in a project while only analyzing files that have recently changed. In addition to providing the Index view, this improved parsing also makes possible features such as code completion and content assist.

How the PDOM parser works

The new parsing process doesn't make the discussion in Part 3 irrelevant. The new parsers function essentially the same as the old Parser. They contain similar methods and create similar abstract syntax trees (ASTs). However, much of the new process has changed. I'll summarize its operation in four steps:

  1. Starting the indexer job
  2. Parsing a TranslationUnit
  3. Updating the PDOM
  4. PDOM Structure and Storage

So you can see how these concepts are implemented in code, I updated the Bare Bones C/C++ Development Tooling (BBCDT) to create an index of the project's elements and display specific node types. To activate it, I updated the previous ASTAction to visit elements in the AST and the PDOM's storage. (See Download for the code.)

Step 1: Starting the indexer job

When the core plug-in starts, it creates a PDOMManager to oversee the new parsing and persistence. The manager does little on its own, but it responds when new CDT elements -- particularly CProjects and TranslationUnits -- appear in the workspace. When a new CProject is created, the PDOMManager creates an indexer to find and store the elements in the project's source files.

The indexer implements IPDOMIndexer, but its class depends on the user's CDT preferences. Figure 2 shows the three indexing options with No Indexer selected by default. When you select No Indexer, the PDOMManager creates a NullPDOMIndexer that does nothing. The PDOMManager creates a FastPDOMIndexer if you select Fast C/C++ Indexer and a FullPDOMIndexer if you select Full C/C++ Indexer. After the indexer is created, the manager tells it to start building the index.

Figure 2. Indexer preferences
Indexer preferences

The goal of indexing is to record all the named elements in the project's source code and store them with their related data. This is a complicated process and can run in the background, so it's performed as an Eclipse Job. Specifically, the indexer creates an indexing Job with a priority of Job.LONG and adds it to the platform's queue.

If the project has just been created, the Job doesn't accomplish much because there's no code to index. Nevertheless, it does build a very important object for the project called a PDOM. This provides the routines that store the indexer's data within the PDOM.

The Protean PDOM

As I write this in October 2006, the PDOM and its parsing are not finalized. In fact, many bewildered respondents on the CDT newsgroup will tell you it's not even stable. This is the future of CDT code analysis, however, and it's important to understand the concepts even if the details change. It plays an important role in enabling the CDT's features (content assist, code completion, indexing, etc.) and this role will only expand as development continues.

Unlike the Parser in Part 3, the indexer doesn't begin immediately or automatically. Instead, the PDOMManager waits for changes to the CProject, such as a new, modified, or deleted source file. When it detects a change, it tells the indexer to start analyzing, and this time, the indexer Job works in earnest. In particular, it finds the altered TranslationUnit and the ILanguage that represents the language of the unit's source code. Then it starts the parsing process by calling ILanguage.getASTTranslationUnit().

Step 2: Parsing the TranslationUnit

The Parser from Part 3 makes no distinction between programming languages. It looks for C and C++ elements and uses them to create a generic AST. The new process works differently. Each TranslationUnit has an ILanguage object that specifies, among other things, how the unit should be parsed. If you'd like to add a custom language to the CDT, creating your own ILanguage is the place to start. Thankfully, the CDT provides two by default: GCCLanguage and GPPLanguage.

Similarly, the CDT provides two default parsers that implement ISourceCodeParser: GNUCSourceParser and GNUCPPSourceParser. The parsing process itself is the same as that described in Part 3. The code reader creates a character buffer from the source file, the scanner combines characters into ITokens, and the parser combines ITokens into IASTNodes. Even the method names (translationUnit(), declaration(), statement(), and others) are the same. The following are the main differences between the new parsing process and the old:

  • The new parsers are language specific -- nodes extend CASTNode or CPPASTNode.
  • The new parsers always run in COMPLETE_PARSE mode, so they always parse inside functions.
  • The new parsers store additional parsing information (problems, node locations) in an ILocationResolver instead of a callback object.
  • The new parsers only determine an element's scope when it is requested.
  • The top node in a new AST is an IASTTranslationUnit, whose structure can be traversed by ASTVisitors.

Remember: The old Parser returned an IASTCompilationUnit. The PDOM parsers return IASTTranslationUnits, or, more specifically, CASTTranslationUnits and CPPASTTranslationUnits.

The fourth difference listed above is important. The parser can't determine how an element is used unless it knows the element's scope. The old Parser creates a top-level scope when it starts the parse with translationUnit(). Then it passes the scope to further methods, such as declaration(), that create and pass along subscopes as necessary. When the old Parser finishes, it has the scope of every element in its AST. The new parser creates a single scope and doesn't change it during the course of its operation. This is because the indexer is only interested in a specific set of nodes. To explain this and other interesting features, I need to describe what the indexer does with the parser's AST and how it interacts with the PDOM.

Step 3: Updating the PDOM

After the indexer receives the IASTTranslationUnit, it gets a write lock on the PDOM and begins persisting the TranslationUnit's data. First, it calls IASTTranslationUnit.getIncludeDirectives() and IASTTranslationUnit.getMacroDefinitions(). Then, it stores a reference to the unit's file in the PDOM, along with references to the included directives and macro definitions. These references take the form of PDOMFiles that can be stored with PDOM.addFile().

To search through, or visit, the AST, the indexer creates a new ASTVisitor and calls IASTTranslationUnit.accept(ASTVisitor). Each of the IASTNodes in the AST has an accept() method, and each works in much the same way:

  1. It checks one of the ASTVisitor's Boolean fields to determine if the visitor is interested in visiting this node.
  2. If the visitor is interested, the node calls the visitor's visit() method, which returns one of three values:
    • PROCESS_SKIP -- The visitor isn't interested in the node's children, so accept() returns true.
    • PROCESS_ABORT -- The visitor isn't interested in visiting at all, so accept() returns false.
    • PROCESS_CONTINUE -- The visitor wants to visit the node's children, so accept() continues to the next step.
  3. If the Boolean field is false or visit() returns PROCESS_CONTINUE, accept() calls the accept() method of each of its children and passes the visitor as its argument.

For example, if a C function declaration has a declaration specifier, such as typedef or static, its corresponding CASTFunctionDefinition.accept() method checks the visitor's shouldVisitDeclarations field. If the field is true, it calls the visitor's visit() method. If the field is false or visit() returns PROCESS_CONTINUE, the function node accesses its child and calls CASTSimpleDeclSpecifier.accept(). When the visitor reaches one of the final, or terminal, nodes, accept() simply returns true.

Table 1 shows the Boolean fields that control which types of nodes are allowed to call the visitor's visit() method. By default, all of them are set to false, so visit() is never called.

Table 1. The ASTVisitor fields that control node visitation
shouldVisitNamesshouldVisitDeclarationsshouldVisitInitializers
shouldVisitDeclaratorsshouldVisitDeclSpecifiersshouldVisitExpressions
shouldVisitTypeIdsshouldVisitEnumeratorsshouldVisitTranslationUnit
shouldVisitParameterDeclarationsshouldVisitStatementsshouldVisitProblems

The best way to understand how ASTVisitors work is to look at an example. Listing 1 shows the visitor used by the PDOMFullIndexerJob to find IASTName nodes. To visit these nodes, it sets shouldVisitNames and shouldVisitDeclarations to true.

Listing 1. The PDOMFullIndexerJob's ASTVisitor
ast.accept(new ASTVisitor() {
   // Set the Boolean fields so that name and declaration nodes call visit()
   {
      shouldVisitNames = true;
      shouldVisitDeclarations = true;
   }

   // Perform the visit
   public int visit(IASTName name) {
      try {
         IASTFileLocation fileloc = name.getFileLocation();
         if (fileloc != null) {
            PDOMFile file = pdom.addFile(fileloc.getFileName());
            linkage.addName(name, file);
         }
         return PROCESS_CONTINUE;
      } catch (Throwable e) {
         CCorePlugin.log(e);
         return ++errorCount > MAX_ERRORS ? PROCESS_ABORT : PROCESS_CONTINUE;
      }
   };
});

As shown, the indexer's visitor is only interested in IASTNames, which are the only nodes stored in the PDOM. These represent specific identifiers, such as the names of functions, variables, and enumerated types. When the visitor reaches one of these nodes, it accesses a PDOMFile object representing the name's source file. Then it uses the PDOM's linkage object (PDOMLinkage) to add the IASTName to the index.

Before the IASTName can be stored, the linkage needs to know its type, or binding. For instance, whether it's the name of a variable, function, etc. To find out, the linkage finds the node's parent and its scope. With this scope, the linkage figures outs how the name is used and encapsulates that information in an IBinding. This is why the new parsing process makes scope determinations after the parse -- it's only interested in the scopes of IASTNames.

To clarify names and bindings, I debugged the CDT's parsing of the Hello World program and provided the AST breakdown shown in Figure 3. The AST contains a single IASTDeclarationUnit that represents the main() function. The CASTFunctionDeclarator holds a CASTName whose IBinding is a CFunction.

Figure 3. Hello World AST
Hello World AST

The linkage adapts the IBinding into a persistible PDOMBinding and the IASTName into a PDOMName. The constructors for these new objects access the PDOM to store their information. To show how this works, I need to back up and explain how the PDOM operates.

Step 4: The PDOM structure

If you've created any CDT projects, I recommend you open your workspace directory and the .metadata/plugins/org.eclipse.cdt.core directory underneath it. There, you'll see a *.pdom file for each of your projects. When the PDOM is created for a project, it constructs a Database object to manage its low-level storage. The Database builds a new RandomAccessFile and gives it the project name and the time in milliseconds -- these are the *.pdom files you see.

Java I/O and Java NIO

Most of the Java™ input/output routines I've seen still use the Java I/O application program interface (API). This moves data in byte streams, providing InputStreams, OutputStreams, Writers, and Readers. This is fine when performance isn't an issue. The classes in the Java New I/O (NIO) API move data using the same efficient, bulk-transfer methods as your operating system. This isn't important for small files, but it makes a big difference when file sizes approach the size of operating system pages (for example, 2 KB, 4 KB, 8 KB). Not coincidentally, the Database initializes its *.pdom files at 16 KB.

In the CDT, the most important Java NIO Buffer is the MappedByteBuffer, which is created on top of an existing RandomAccessFile. Normal file access requires buffering, heap allocation, and file I/O routines. MappedByteBuffers, on the other hand, are immediately updated by the virtual memory controller, and the memory doesn't affect the Java heap. Finally, since MappedByteBuffers lie outside process memory, multiple processes can efficiently access them at the same time. However, because it operates so closely to the operating system, its behavior is more operating system-dependent than Java I/O routines.

This Database has more in common with regular databases than just the name. It divides memory into variable-length records composed of 16-byte blocks. Each record represents a different PDOMNode, such as a PDOMLinkage or PDOMBinding. To enable memory access, it provides put and set methods for ints, chars, and bytes.

To make reading and writing to the file as efficient as possible, the Database uses an array of Chunks that manage MappedByteBuffers on the *.pdom file. When the Database creates this file, it sets its size to 16 KB and allocates the first Chunk in the array to buffer. The Database creates further 16 KB Chunks as the index requires more memory.

The first object added to the database is the PDOMLinkage. New linkages are added as new languages are used in the project. When its constructor is called, the PDOMLinkage:

  • Sets its initial record size at 24 bytes.
  • Stores a 0 to represent the node's type (linkage).
  • Stores a 0 to show that it has no parent node.
  • Stores the size of memory needed to store the linkage's name.
  • Allocates the memory and stores the linkage's name.
  • Stores the size of memory needed to store the linkage's ID.
  • Allocates the memory and stores the linkage's ID.
  • Stores its record size at a specific address, if this is the first project linkage. If it isn't the first project linkage, it finds the next available address and stores the size there.

The memory allocation is performed by Database.malloc(int size), which starts by determining the minimum number of blocks needed to provide the requested memory. If there isn't enough memory available, it creates a new Chunk. Otherwise, it accesses the Chunk that can provide the memory, clears the required number of blocks, and returns the location of the first allocated block.

The storage process is similar for other objects, such as when the linkage creates a new PDOMBinding as part of storing a named node. If the binding isn't contained within the scope of another binding, the linkage sets itself as the binding's parent. Then, the PDOMBinding constructor sets its initial record size at 24 bytes and performs most of the same steps as those listed above.

The PDOMName works differently. There are three ways a name can be associated with its binding: as a definition, declaration, or reference. When the PDOMName determines which role it plays, it updates the PDOMBinding's record. Then it stores the location of the PDOMFile whose file contains it, along with its location and length. Finally, it populates the file's index by adding its block location to the PDOMFile.

Because of the parent/child relationship of PDOMNodes, the index forms a tree similar to the parser's AST. The first time PDOMLinkage.getIndex() is called, it creates a BTree using the PDOM's Database and the root block location. You can search through this tree with an IPDOMVisitor. This functions like the ASTVisitor with three important differences:

  • It has no Boolean fields -- all nodes are visited.
  • It uses only a single visit(IPDOMNode node) method, which checks if the visited node is of a certain type.
  • It contains a leave(IPDOMNode node) to stop search when a particular node type is reached.

You can also search through the PDOM using the static methods provided in the DOMSearchUtil class.


Updating the BBCDT

I've added the PDOM classes, and considering the size of the application, Bare Bones just doesn't seem appropriate anymore. I've set the PDOMManager to create only PDOMFullIndexers for project indexing, so there are no preferences or descriptors involved. Also, I changed the ASTAction to perform two tasks: search through the IASTTranslationUnit with a BBASTVisitor and search through the PDOM with a BBPDOMVisitor. These classes are in the org.dworks.bbcdt.ui.action package. The result of clicking ASTAction is shown in Figure 4.

Figure 4. Contents of the BBCDT index
Contents of the BBCDT Index

The BBASTVisitor searches through the AST finding IASTNames and IASTForStatements. When it discovers a loop, it displays three expressions: the initializer, condition, and iteration. The BBPDOMVisitor finds functions in the BBPDOMVisitor and displays the number present.


Conclusion

Regarding parsing, Java NIO, and pseudo-database access, there's no question that indexing is a complicated process. However, this complexity is needed to provide efficient, all-embracing code analysis and storage.

Now that I've explained the PDOM, it's time to see how the CDT applies it. In Part 5, I discuss code completion and content assist, including how they work and how they access the PDOM index.


Download

DescriptionNameSize
BBCDT code for Part 4os-ecl-cdt4.zip512KB

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Open source on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Open source
ArticleID=169638
ArticleTitle=Building a CDT-based editor, Part 4: Advanced CDT parsing and the Persisted Document Object Model
publish-date=10242006