Distributed compilation

A programmer’s delight

Learn about open source tool options that can help speed up your build process by distributing the process across multiple machines in a local area network.

Arpan Sen (arpansen@gmail.com), Independent author

Arpan Sen is a lead engineer working on the development of software in the electronic design automation industry. He has worked on several flavors of UNIX, including Solaris, SunOS, HP-UX, and IRIX as well as Linux and Microsoft Windows for several years. He takes a keen interest in software performance-optimization techniques, graph theory, and parallel computing. Arpan holds a post-graduate degree in software systems. You can reach him at arpansen@gmail.com.


developerWorks Contributing author
        level

11 November 2008

Also available in Chinese Russian

Reducing the build time for C/C++-based systems is one of the major technical challenges any release or build engineer faces. This article looks into some of the open source tool options available that help speed up the build process by parallelizing the activity: distributing the build process across multiple machines in a local area network. The discussion in this article primarily focuses on GNU make, due to its wide availability.

The –j option in GNU make

By default, make is a sequential utility. It serially invokes the underlying compiler to compile C/C++ sources. Typically, C/C++ source files (usually with a .cpp/.cxx extension) can be built without depending on each other. You do so by invoking make with the –j option. Listing 1 shows a typical usage.

Listing 1. Typical GNU make invocation
make –j10 –f makefile.x86_linux

The argument to –j -- 10 -- is the maximum number of simultaneous compilations that can ensue once the build process starts. If no argument is provided to -j, then all source files are queued up in the system for simultaneous compilation. Using the -j option makes particular sense when you're running the build on a multicore system. To make the -j option work for you, you must address several key issues; these are discussed in the next section.

Issues and potential solutions when using the –j option

First, you should check your system configuration. On a low-memory (<512MB RAM) system, too many simultaneous compilations can slow the system due to paging. The compile time increases in such cases. You need to experiment to figure out the optimal value of -j for your system. Another option is to use the –l or –load-average option of the GNU make tool, along with -j, which keeps firing jobs only if the system load is less than a certain level.

You can also use the same temporary file for independent compilations. Consider the make snippet shown in Listing 2.

Listing 2. Makefile with the same temporary file y.tab.c
my_parser : main.o parser1.o parser2.o
       g++ -o $* $>


parser1.o : parser1.y 
       yacc parser1.y
       g++ -o $* -c y.tab.c

parser2.o : parser2.y 
       yacc parser2.y
       g++ -o $* -c y.tab.c

Assume that the grammar files parser1.y and parser2.y are located in the same directory. During sequential compilation, the file y.tab.c is generated by yacc (where y.tab.c is the default filename) for parserl and then parser2; but in parallel mode, this results in a conflict. You can solve this situation a couple of ways: keep the two yacc files in separate folders; or use the –b option to generate two different C outputs, as shown in Listing 3.

Listing 3. Use the –b option of yacc to generate unique filenames
parser1.o : parser1.y 
       yacc parser1.y –b parser1
       g++ -o $* -c parser1.tab.c

You must take a close look into the makefile to figure out such situations, where parallelizing an otherwise fine script in serial mode will mess things up if it's run in parallel.

Some makefile rules have implicit dependencies. Consider the situation shown in Listing 4, where a Perl script generates a header that is included by other sources.

Listing 4. Makefile with implicit dependencies
my_exe: info.h test1.o test2.o 
       g++ -o $@ $^ 

test1.o: test1.cxx 
       g++ -c $<

test2.o: test2.cxx 
       g++ -c $<

info.h: 
       make_header #shell script that generates the header file

The info.h header is included by test1.cxx and test2.cxx. In serial build mode, make works from left to right, and the file info.h is generated first. However, in parallel build mode, make is free to process all dependencies in parallel -- this can potentially result in some compilations failing intermittently because info.h may not be generated before the compilation of test1.cxx and/or test2.cxx starts. To fix this problem, it makes sense to remove info.h from the dependency list of my_exe and put it in the dependency list of test1.o and test2.o. It's also advisable to use another wrapper to ensure that info.h is generated only once. Listing 5 shows the modified version of the make_header script, and Listing 6 shows the makefile.

Listing 5. Modified version of make_header script to prevent multiple writes
#!/usr/bin/bash

if [ -f info.h ]
then
  exit
fi

echo "#ifndef __INFO_H" > info.h
echo "#define __INFO_H" > > info.h

echo "#include <iostream>>" > > info.h
echo "using namespace std;" > > info.h
echo "int f1(int);" > > info.h
echo "int f2(int);" > > info.h

echo "#endif" > > info.h
Listing 6. Modified version of the makefile from Listing 4
my_exe: info.h test1.o test2.o 
    g++ -o $@ $^ 

test1.o: test1.cxx info.h
    g++ -c $<

test2.o: test2.cxx info.h
    g++ -c $<

info.h: 
    make_header #shell script that generates the header file

In general, make-j can extract sufficient parallelism if you create the makefile properly. Try to avoid unnecessary dependencies in the makefile wherever possible.

Note that GNU make can only extract parallelism for a single machine. The next section introduces distcc, a tool that lets you share the build process on multiple machines.

Introducing distcc

The distcc tool can distribute the builds of C/C++ code across multiple machines. Each of these machines must have distcc installed. Here's a quick installation and configuration reference:

  1. Download distcc (see the Resources section).
  2. Build the distcc sources on all machines by executing ./configure; make && make install.
  3. The build process starts from one machine and is then distributed on all the other machines (servers). On all the servers, start the distccd daemon (you must have root privileges to do this). distccd resides in /etc/init.d folder. The syntax to start it in root mode is
    tcsh-arpan# /etc/init.d/distccd start

    And the syntax to start it in user mode is
    tcsh-arpan$ sudo /etc/init.d/distccd

    You can also run distcc daemon processes in user mode by running distccd –daemon –j N, where N is the number of jobs you want to run on a given machine.
  4. The local machine needs to know which servers the build processes should be distributed to. Depending on your shell, issue a modified version of this command:
    export DISTCC_HOSTS='localhost tintin asterix pogo'

    tintin, asterix, and pogo are other hosts in the network that can host build processes; localhost refers to the local machine.
  5. Instead of using the export directive, you can also create a file named hosts and put the names of the servers in that file, separated by spaces. Copy this file to the $HOME/.distcc folder.

How does distcc work?

distcc works by sending the preprocessed code to other designated machines in the network. The distccd daemon process ensures that the compilation happens on the remote machine. distcc is designed to be used with the parallel build (-j ) option of GNU make. distcc isn't a compiler in itself; it just serves as a front end to g++. Almost all options of g++ can be passed as is to distcc.

Now that distcc has been installed, the only thing remaining to be done is to fire the build. Here's the invocation:

make –j4 CC=distcc –f makefile.x86_linux

Key things to keep in mind while working with distcc

For distcc to work to your advantage, you must keep several things in mind:

  • The machines must have identical configurations. This means the same version of the g++ compiler must be installed on all the machines, along with related build tools like ar, ranlib, libtool, and so on. The type and version of the operating system should also be the same.
  • From the client machine, distcc sends the preprocessed code to the server machines. You need to verify whether the distccd daemon process is running on the server machine.
  • By default, the number of jobs that distcc schedules on a single machine is (no. of CPUs) + 2. For a single core machine, this number is 3. Keep this in mind while you're firing the processes: a command line like make –j10 CC=distcc, where there are only three hosts, means nine compile jobs are fired initially.
  • Verify that the underlying machines can access the requisite file systems on which source files are stored. On Network File System (NFS) based systems, some source areas may not be mounted, which will result in compilation fails. You must also carefully monitor network congestion.
  • distcc is used to compile the sources over the network. The linking step(s) may not be parallelized.

What about those parts of the build process that must be run sequentially?

Some steps in the build process may not be parallelized -- using scripts to generate certain headers, linking, and so on must be performed on a single machine. To better handle this situation, it's a good idea to split the original makefile into couple of makefiles, clearly demarcating those that can and can't be parallelized, and run them as follows:

tcsh-arpan$ make –f make.init; make CC=distcc –j4 –f make.compile_x86; make –f make.link

Monitoring the distcc compilation process

distcc installation has a console-based monitoring tool called distccmon-text. Prior to starting the build process, it's worthwhile to open a separate terminal window and issue distccmon-text 5. This terminal then continuously displays the compile status at multiple nodes in the network every five seconds. Listing 7 shows a sample of the monitoring window.

Listing 7: Output from distccmon-text
2167  Compile     memory.c                    tintin[0]
2164  Compile     main.cxx                     tintin[1]
2192  Compile     ui_tcl.cxx                  asterix[0]
2187  Compile     traverse.c                  asterix[1]
2177  Compile     reports.cxx                  pogo[0]
2184  Compile     messghandler.c           pogo[1]
2181  Compile     trace.cpp                  localhost[0]
2189  Compile     remote.c                  localhost[1]

Use ccache to further speed up compilation

Usually, when header files are modified in a C/C++ development framework, an average make-based system ends up recompiling all source files. Typically, header-file changes affect only a subset of the source files, so a time-consuming clean build isn't needed. You can also use ccache, a tool that drastically reduces the time it takes to clean-build a system, often by a factor of 5 to 10.

ccache acts as a cache to the compiler. It works by creating a hash from the preprocessed sources and the compiler options used to compile the sources. While recompiling, if ccache detects no changes in the preprocessed source and compiler options, it retrieves its cached copy of the previously compiled output. This helps speed up the compilation process.

Install ccache

To download the latest version (2.4) of ccache, see the Resources section. Once in the ccache directory, issue the command ./configure –prefix=/usr/bin followed by make && make install. If ccache isn't installed in /usr/bin, verify that the ccache location is defined as part of the PATH environment variable.

Ccache environment variables

The following are some of the environment variables you can use to customize the ccache setup:

  • CCACHE_DIR -- Specifies the folder where ccache stores the precompiled outputs. If you don't define this variable, then by default the cached output is stored in $HOME/.ccache.
  • CCACHE_TEMPDIR -- Specifies the folder where ccache puts temporary files that it generates. If you don't define this variable, then by default $HOME/.ccache is used. It's a good idea to define both this variable and CCACHE_DIR -- most organizations have a user quota for specific file-system areas, and if $HOME belongs to such an area the quota will quickly be exhausted. Explicitly setting the cache area avoids this problem.
  • CCACHE_DISABLE -- If set, tells ccache to invoke the compiler proper, bypassing the cache. Used for diagnostic purposes.
  • CCACHE_RECACHE -- If set, tells ccache to ignore the existing entries in the cache and calls the compiler; but for new entries, it caches the result. Used for diagnostic purposes.
  • CCACHE_LOGFILE -- If set, tells ccache to record the hit and miss statistics from the cache in this file. Very useful for diagnostics.
  • CCACHE_PREFIX -- Adds a prefix to the command line that ccache uses to invoke the compiler proper. This is used in particular to interface ccache with distcc, as described in detail in the next section.

Use ccache

You can use ccache with or without distcc. It doesn't depend on the -j makefile option. The simplest usage of ccache is as follows: ccache g++ -o <executable name> <source file(s)>. When used with a makefile, overriding the CC variable suffices; see Listing 8.

Listing 8. Sample makefile using the CC variable
CC := g++
app1: placer1.o route1.o floorplan1.o
    $(CC) –o $* $^ 
placer1.o: placer1.cxx
    $(CC) –o $* -c $<
…

With the makefile in Listing 8, the syntax to issue make is make "CC=ccache g++".

To use distcc with ccache, you need to set the CCACHE_PREFIX environment variable to distcc, as follows: export CCACHE_PREFIX=distcc. (This syntax is valid for the bash shell. If you use another shell, modify the syntax accordingly.)

Here's a ample make invocation with ccache and distcc:

export CCACHE_PREFIX=distcc; make "CC=ccache g++" –j4 –f makefile.x86

The actual invocation in the shell prompt during the build process looks like this: ccache distcc –o placer1.o –c placer1.cxx, and so on. Note that ccache only needs to be installed on the local machine. ccache makes the first check to decide whether the copy in the local cache suffices; otherwise, it hands the baton to distcc for distributed compilation.

Conclusion

This article has delved into the GNU make, distcc, and ccache tools, which can help you parallelize the build process. These tools come with several other features that can further customize this effort -- for example, ccache has a –M option that restricts the size of the cache; and distcc installation has a GUI- based monitor called distcc-gnome that tracks network build activity (it's created if distcc is built using the –use-gtk option). The links in the Resources section provide further information.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into AIX and Unix on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=AIX and UNIX
ArticleID=347620
ArticleTitle=Distributed compilation
publish-date=11112008