Skip to main content

Distributed compilation

A programmer’s delight

Arpan Sen (arpansen@gmail.com), SMTS, Mentor Graphics
Arpan Sen is a lead engineer working on the development of software in the electronic design automation industry. He has worked on several flavors of UNIX, including Solaris, SunOS, HP-UX, and IRIX as well as Linux and Microsoft Windows for several years. He takes a keen interest in software performance-optimization techniques, graph theory, and parallel computing. Arpan holds a post-graduate degree in software systems. You can reach him at arpan@syncad.com.

Summary:  Learn about open source tool options that can help speed up your build process by distributing the process across multiple machines in a local area network.

Date:  11 Nov 2008
Level:  Intermediate PDF:  A4 and Letter (36KB)Get Adobe® Reader®
Activity:  3499 views

Reducing the build time for C/C++-based systems is one of the major technical challenges any release or build engineer faces. This article looks into some of the open source tool options available that help speed up the build process by parallelizing the activity: distributing the build process across multiple machines in a local area network. The discussion in this article primarily focuses on GNU make, due to its wide availability.

The –j option in GNU make

By default, make is a sequential utility. It serially invokes the underlying compiler to compile C/C++ sources. Typically, C/C++ source files (usually with a .cpp/.cxx extension) can be built without depending on each other. You do so by invoking make with the –j option. Listing 1 shows a typical usage.


Listing 1. Typical GNU make invocation

make –j10 –f makefile.x86_linux

The argument to –j -- 10 -- is the maximum number of simultaneous compilations that can ensue once the build process starts. If no argument is provided to -j, then all source files are queued up in the system for simultaneous compilation. Using the -j option makes particular sense when you're running the build on a multicore system. To make the -j option work for you, you must address several key issues; these are discussed in the next section.

Issues and potential solutions when using the –j option

First, you should check your system configuration. On a low-memory (<512MB RAM) system, too many simultaneous compilations can slow the system due to paging. The compile time increases in such cases. You need to experiment to figure out the optimal value of -j for your system. Another option is to use the –l or –load-average option of the GNU make tool, along with -j, which keeps firing jobs only if the system load is less than a certain level.

You can also use the same temporary file for independent compilations. Consider the make snippet shown in Listing 2.


Listing 2. Makefile with the same temporary file y.tab.c

my_parser : main.o parser1.o parser2.o
       g++ -o $* $>


parser1.o : parser1.y 
       yacc parser1.y
       g++ -o $* -c y.tab.c

parser2.o : parser2.y 
       yacc parser2.y
       g++ -o $* -c y.tab.c

Assume that the grammar files parser1.y and parser2.y are located in the same directory. During sequential compilation, the file y.tab.c is generated by yacc (where y.tab.c is the default filename) for parserl and then parser2; but in parallel mode, this results in a conflict. You can solve this situation a couple of ways: keep the two yacc files in separate folders; or use the –b option to generate two different C outputs, as shown in Listing 3.


Listing 3. Use the –b option of yacc to generate unique filenames

parser1.o : parser1.y 
       yacc parser1.y –b parser1
       g++ -o $* -c parser1.tab.c

You must take a close look into the makefile to figure out such situations, where parallelizing an otherwise fine script in serial mode will mess things up if it's run in parallel.

Some makefile rules have implicit dependencies. Consider the situation shown in Listing 4, where a Perl script generates a header that is included by other sources.


Listing 4. Makefile with implicit dependencies

my_exe: info.h test1.o test2.o 
       g++ -o $@ $^ 

test1.o: test1.cxx 
       g++ -c $<

test2.o: test2.cxx 
       g++ -c $<

info.h: 
       make_header #shell script that generates the header file 

The info.h header is included by test1.cxx and test2.cxx. In serial build mode, make works from left to right, and the file info.h is generated first. However, in parallel build mode, make is free to process all dependencies in parallel -- this can potentially result in some compilations failing intermittently because info.h may not be generated before the compilation of test1.cxx and/or test2.cxx starts. To fix this problem, it makes sense to remove info.h from the dependency list of my_exe and put it in the dependency list of test1.o and test2.o. It's also advisable to use another wrapper to ensure that info.h is generated only once. Listing 5 shows the modified version of the make_header script, and Listing 6 shows the makefile.


Listing 5. Modified version of make_header script to prevent multiple writes

#!/usr/bin/bash

if [ -f info.h ]
then
  exit
fi

echo "#ifndef __INFO_H" > info.h
echo "#define __INFO_H" > > info.h

echo "#include <iostream>>" > > info.h
echo "using namespace std;" > > info.h
echo "int f1(int);" > > info.h
echo "int f2(int);" > > info.h

echo "#endif" > > info.h


Listing 6. Modified version of the makefile from Listing 4

my_exe: info.h test1.o test2.o 
    g++ -o $@ $^ 

test1.o: test1.cxx info.h
    g++ -c $<

test2.o: test2.cxx info.h
    g++ -c $<

info.h: 
    make_header #shell script that generates the header file 

In general, make -j can extract sufficient parallelism if you create the makefile properly. Try to avoid unnecessary dependencies in the makefile wherever possible.

Note that GNU make can only extract parallelism for a single machine. The next section introduces distcc, a tool that lets you share the build process on multiple machines.

Introducing distcc

The distcc tool can distribute the builds of C/C++ code across multiple machines. Each of these machines must have distcc installed. Here's a quick installation and configuration reference:

  1. Download distcc (see the Resources section).
  2. Build the distcc sources on all machines by executing ./configure; make && make install.
  3. The build process starts from one machine and is then distributed on all the other machines (servers). On all the servers, start the distccd daemon (you must have root privileges to do this). distccd resides in /etc/init.d folder. The syntax to start it in root mode is
    tcsh-arpan# /etc/init.d/distccd start
    

    And the syntax to start it in user mode is
    tcsh-arpan$ sudo /etc/init.d/distccd 
    

    You can also run distcc daemon processes in user mode by running distccd –daemon –j N, where N is the number of jobs you want to run on a given machine.
  4. The local machine needs to know which servers the build processes should be distributed to. Depending on your shell, issue a modified version of this command:
    export DISTCC_HOSTS='localhost tintin asterix pogo'
    

    tintin, asterix, and pogo are other hosts in the network that can host build processes; localhost refers to the local machine.
  5. Instead of using the export directive, you can also create a file named hosts and put the names of the servers in that file, separated by spaces. Copy this file to the $HOME/.distcc folder.

How does distcc work?

distcc works by sending the preprocessed code to other designated machines in the network. The distccd daemon process ensures that the compilation happens on the remote machine. distcc is designed to be used with the parallel build (-j ) option of GNU make. distcc isn't a compiler in itself; it just serves as a front end to g++. Almost all options of g++ can be passed as is to distcc.

Now that distcc has been installed, the only thing remaining to be done is to fire the build. Here's the invocation:

make –j4 CC=distcc –f makefile.x86_linux

Key things to keep in mind while working with distcc

For distcc to work to your advantage, you must keep several things in mind:

  • The machines must have identical configurations. This means the same version of the g++ compiler must be installed on all the machines, along with related build tools like ar, ranlib, libtool, and so on. The type and version of the operating system should also be the same.
  • From the client machine, distcc sends the preprocessed code to the server machines. You need to verify whether the distccd daemon process is running on the server machine.
  • By default, the number of jobs that distcc schedules on a single machine is (no. of CPUs) + 2. For a single core machine, this number is 3. Keep this in mind while you're firing the processes: a command line like make –j10 CC=distcc, where there are only three hosts, means nine compile jobs are fired initially.
  • Verify that the underlying machines can access the requisite file systems on which source files are stored. On Network File System (NFS) based systems, some source areas may not be mounted, which will result in compilation fails. You must also carefully monitor network congestion.
  • distcc is used to compile the sources over the network. The linking step(s) may not be parallelized.

What about those parts of the build process that must be run sequentially?

Some steps in the build process may not be parallelized -- using scripts to generate certain headers, linking, and so on must be performed on a single machine. To better handle this situation, it's a good idea to split the original makefile into couple of makefiles, clearly demarcating those that can and can't be parallelized, and run them as follows:
tcsh-arpan$ make –f make.init; 
make CC=distcc –j4 –f make.compile_x86; 
make –f make.link

Monitoring the distcc compilation process

distcc installation has a console-based monitoring tool called distccmon-text. Prior to starting the build process, it's worthwhile to open a separate terminal window and issue distccmon-text 5. This terminal then continuously displays the compile status at multiple nodes in the network every five seconds. Listing 7 shows a sample of the monitoring window.


Listing 7: Output from distccmon-text

2167  Compile     memory.c                    tintin[0]
2164  Compile     main.cxx                     tintin[1]
2192  Compile     ui_tcl.cxx                  asterix[0]
2187  Compile     traverse.c                  asterix[1]
2177  Compile     reports.cxx                  pogo[0]
2184  Compile     messghandler.c           pogo[1]
2181  Compile     trace.cpp                  localhost[0]
2189  Compile     remote.c                  localhost[1]

Use ccache to further speed up compilation

Usually, when header files are modified in a C/C++ development framework, an average make-based system ends up recompiling all source files. Typically, header-file changes affect only a subset of the source files, so a time-consuming clean build isn't needed. You can also use ccache, a tool that drastically reduces the time it takes to clean-build a system, often by a factor of 5 to 10.

ccache acts as a cache to the compiler. It works by creating a hash from the preprocessed sources and the compiler options used to compile the sources. While recompiling, if ccache detects no changes in the preprocessed source and compiler options, it retrieves its cached copy of the previously compiled output. This helps speed up the compilation process.

Install ccache

To download the latest version (2.4) of ccache, see the Resources section. Once in the ccache directory, issue the command ./configure –prefix=/usr/bin followed by make && make install. If ccache isn't installed in /usr/bin, verify that the ccache location is defined as part of the PATH environment variable.

Ccache environment variables

The following are some of the environment variables you can use to customize the ccache setup:

  • CCACHE_DIR -- Specifies the folder where ccache stores the precompiled outputs. If you don't define this variable, then by default the cached output is stored in $HOME/.ccache.
  • CCACHE_TEMPDIR -- Specifies the folder where ccache puts temporary files that it generates. If you don't define this variable, then by default $HOME/.ccache is used. It's a good idea to define both this variable and CCACHE_DIR -- most organizations have a user quota for specific file-system areas, and if $HOME belongs to such an area the quota will quickly be exhausted. Explicitly setting the cache area avoids this problem.
  • CCACHE_DISABLE -- If set, tells ccache to invoke the compiler proper, bypassing the cache. Used for diagnostic purposes.
  • CCACHE_RECACHE -- If set, tells ccache to ignore the existing entries in the cache and calls the compiler; but for new entries, it caches the result. Used for diagnostic purposes.
  • CCACHE_LOGFILE -- If set, tells ccache to record the hit and miss statistics from the cache in this file. Very useful for diagnostics.
  • CCACHE_PREFIX -- Adds a prefix to the command line that ccache uses to invoke the compiler proper. This is used in particular to interface ccache with distcc, as described in detail in the next section.

Use ccache

You can use ccache with or without distcc. It doesn't depend on the -j makefile option. The simplest usage of ccache is as follows: ccache g++ -o <executable name> <source file(s)>. When used with a makefile, overriding the CC variable suffices; see Listing 8.


Listing 8. Sample makefile using the CC variable

CC := g++
app1: placer1.o route1.o floorplan1.o
    $(CC) –o $* $^ 
placer1.o: placer1.cxx
    $(CC) –o $* -c $<
… 

With the makefile in Listing 8, the syntax to issue make is make "CC=ccache g++".

To use distcc with ccache, you need to set the CCACHE_PREFIX environment variable to distcc, as follows: export CCACHE_PREFIX=distcc. (This syntax is valid for the bash shell. If you use another shell, modify the syntax accordingly.)

Here's a ample make invocation with ccache and distcc:

export CCACHE_PREFIX=distcc; make "CC=ccache g++" –j4 –f makefile.x86 

The actual invocation in the shell prompt during the build process looks like this: ccache distcc –o placer1.o –c placer1.cxx, and so on. Note that ccache only needs to be installed on the local machine. ccache makes the first check to decide whether the copy in the local cache suffices; otherwise, it hands the baton to distcc for distributed compilation.

Conclusion

This article has delved into the GNU make, distcc, and ccache tools, which can help you parallelize the build process. These tools come with several other features that can further customize this effort -- for example, ccache has a –M option that restricts the size of the cache; and distcc installation has a GUI- based monitor called distcc-gnome that tracks network build activity (it's created if distcc is built using the –use-gtk option). The links in the Resources section provide further information.


Resources

Learn

Get products and technologies

Discuss

About the author

Arpan Sen is a lead engineer working on the development of software in the electronic design automation industry. He has worked on several flavors of UNIX, including Solaris, SunOS, HP-UX, and IRIX as well as Linux and Microsoft Windows for several years. He takes a keen interest in software performance-optimization techniques, graph theory, and parallel computing. Arpan holds a post-graduate degree in software systems. You can reach him at arpan@syncad.com.

Comments (Undergoing maintenance)



Trademarks  |  My developerWorks terms and conditions

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=AIX and UNIX
ArticleID=347620
ArticleTitle=Distributed compilation
publish-date=11112008
author1-email=arpansen@gmail.com
author1-email-cc=mmccrary@us.ibm.com

My developerWorks community

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere).

My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Special offers