Skip to main content

skip to main content

developerWorks  >  Linux  >

Reduce compile time with distcc

A fast, free distributed method for C/C++ compilation

developerWorks
Document options

Document options requiring JavaScript are not displayed


Rate this page

Help us improve this content


Level: Intermediate

Laurence Bonney (bonneyl@uk.ibm.com), WebSphere MQ JMS Test Team Lead, IBM

22 Jun 2004

Some people prefer the convenience of pre-compiled binaries in the form of RPMs or other such installer methods. But this can be a false economy, especially with programs that are used frequently: precompiled binaries will never run as quickly as those compiled with the right optimizations for your own machine. If you use a distributed compiler, you get the best of both worlds: fast compile and faster apps. All you need is distcc.

Given the nature of open source software, many Linux™ applications are distributed in a "tarball" containing source code that you must build before you can run the application. Larger applications can take several hours to build. This article shows how you can use the distributed C compiler, distcc, to speed up the compilation of these sources so you can start using them sooner.

Tarball advantages

Some Linux applications are available as RPM (Red Hat Package Manager) files. These files generally help end users quickly get up and running with the application. However, particularly with open source software, there is often a .tar.gz option (a "tarball"), usually containing source code that the end user would need to build. Although these are a little trickier to set up than their RPM cousins, there are several advantages to tarball methods:

  • The applications generally end up installed in /usr/local, meaning you can easily zap the install files on your machine and still keep your applications.
  • You can tweak the code to better suit it to your own needs.
  • Many optimizations are available.

This last point is my favorite. Being a bit of a power user, I like to be able to tell programs to make maximum use of my Athlon XP processor; I am able to do this at compile time. There is one caveat: turning on the optimization means that the build time increases. The compiler attempts to do clever things, such as following loops and pulling out constants. The end result is extremely quick code at the expense of build time.

Let's take a look at a typical application: OpenSSH. I've just downloaded the openssh-3.7p1.tar.gz tarball from the Web site (see Resources for a link), and I'm going to use this as a test application.

First, I extract the tarball:

me@mymachine:~> tar xvzf openssh-3.7p1.tar.gz

Here, x extracts the tar file, v gives verbose output, z tells the tar command to gunzip (uncompress) the file, and f is the tar file I wish to extract. In the case of a .tar.bz2 file, I would replace the z with a j to indicate that tar should bunzip the file rather than gunzip it. To find out more about tar and its options, refer to the man pages by typing man tar at the command line.

Then I'll switch to the newly created openssh directory:

me@mymachine:~> cd openssh-3.7p1

I'm going to specify some compiler options for gcc, since this is the tool I'm going to use to build the source files, and I want to take advantage of my machine's features. Since I'm using bash, I'm going to use the export command (on tcsh or similar, use the setenv command):

me@mymachine:~/openssh-3.7p1> export CFLAGS="-O3 -march=athlon-xp \
-funroll-loops -fexpensive-optimizations"
me@mymachine:~/openssh-3.7p1> export CXXFLAGS=$CFLAGS

Note the -march flag. Because my workstation has an AMD Athlon XP processor in it, I can use this handy -march=athlon-xp switch in gcc 3.x to automatically turn on the processor-specific optimizations such as SSE. I can also use -march=pentium4, or -march=pentiumpro, or leave it out altogether. Check the man pages for gcc for a complete list and description of available optimizations.

That'll do for compiler options. If you aren't familiar with these options, then you'll be pleased to know you can place this export code inside your ~/.bashrc file and it will always default to them.

Next I need to configure the build for my machine with the options I want to include in my SSH build. I can see these options by typing:

me@mymachine:~/openssh-3.7p1> ./configure --help

I could include some or all of these options if I wanted, but I'm happy with the default ones for now, so I'll just run configure on its own:

me@mymachine:~/openssh-3.7p1> ./configure

Now all I need to do is build the source code, easily achieved by using the make command:

Glossary of terms


make
A free software build utility that uses a makefile to compile and link source code to create an executable binary. See also man make.

configure
A script that generates makefiles for make, forming the first part of the tarballer's mantra: ./configure && make && make install.

makefile
A configuration file used by make; it defines the location of the source files that contain the source code, and how they will be compiled and linked.

automake
Generates makefile includes from another makefile. Works "behind-the-scenes." See also man automake.

make clean
Removes all the old binary object files and executables, allowing you to perform a "clean" build of a program that has been previously compiled.


build
The process of preprocessing, compiling, and linking an application.

preprocess
The process of expanding "include" files before compilation (some languages don't require this).

compile
The process of converting human-readable, textual source code into machine-readable, binary object code.

linking
The process of linking separate files containing binary object code into a single executable, or linking in outside libraries.


tarball
A single file that contains many files, used for archival and backup purposes, as well as for distributing source files. Often compressed with gzip or bzip, tar files (or tarballs) are packed and expanded using the tar command. See also man tar.

me@mymachine:~/openssh-3.7p1> make

This is the point where I grab a coffee, since this usually takes some time. Once this is done, I have all the parts of OpenSSH I requested, and the resulting OpenSSH binary has all the optimizations I gave to my compiler. I timed the build, and this took 2 minutes and 25 seconds, easily enough time to get that coffee.

However, I'm unhappy with this time. My computer has been busy for 2 minutes and 25 seconds when it could have been doing something else. Two minutes doesn't seem like a long time, but OpenSSH is a very small application. In the case of a much larger program, or when you are developing and compiling your code dozens of times a day, builds can eat up as many as several hours out of your day. I'm a busy guy, and can't afford that kind of downtime. So, armed with impatience, I'm going to get distcc.



Back to top


Compiling with distcc

distcc is a little application that hooks onto the gcc compiler and allows the compilation to occur on other machines where distcc is installed. The first step is to get distcc onto your workstation, so download the latest version from the Web site (see Resources for a link).

If you're running on SUSE Linux, then you can get packages from SUSE or off the installation media; for Gentoo Linux, you can run emerge distcc; Debian lets you apt-get install distcc; and there's a FreeBSD port for it if you're so inclined.

For anyone else (or just those who like tarballs, like me) get the .tar.gz file and:

me@mymachine:~/distcc-2.12.1> ./configure --with-gtk
me@mymachine:~/distcc-2.12.1> make

then become the superuser and install:

me@mymachine:~/distcc-2.12.1> sudo make install

This should set up all the required files, and the distcc daemon (distccd) should be living in /etc/init.d/distccd for a nice automatic start on boot. If it's not appropriately linked into the rc.d directories, you can do that yourself.

Rather than reboot (we are trying to save ourselves time, after all), we'll just start the daemon manually for now:

me@mymachine:~/distcc-2.12.1> sudo /etc/init.d/distccd start

It's worth noting that it is possible to run the distcc daemon even if you don't have root access, which is nice. The distcc daemon will just run under your username on whichever machine it's started.

Now, just having distcc on one machine is pointless; this won't really give us any benefit. I'm going to find three friends on my LAN who are running Linux and see if they're interested, since everyone who installs distcc can benefit from the "pool."

It is also worth noting that apart from the version of gcc you are running, there doesn't need to be anything else common about the machines: they needn't share a filesystem, header files, or libraries, or even be running the same Linux kernel or distribution.

After this is done, I need to tell distcc which machines are available for it to use. Let's call them "flim," "flam," and "jabberwocky." I do this with another export, this time setting the environment variable DISTCC_HOSTS (this can also be placed in ~/.bashrc for more permanent use):

me@mymachine:~> export DISTCC_HOSTS="mymachine flim flam jabberwocky"

However, my machine isn't quite as fast as flim and jabberwocky are, so I'll move them up the list. The distccd seems to work on a first come, first working basis:

me@mymachine:~> export DISTCC_HOSTS="flim jabberwocky mymachine flam"

We should be all set now. Let's revisit our OpenSSH build and see how it fares when performed on three machines instead of just one:

me@mymachine:~> cd openssh-3.7p1

Because you've exported the environment variables for CFLAGS, CXXFLAGS, and DISTCC_HOSTS already, you can just continue regardless and it should remember your settings, unless you placed them in ~/.bashrc, in which case they will run automatically.

Now clean up the previous make results, to get a blank canvas:

me@mymachine:~/openssh-3.7p1> make clean

One more thing before you start. The distcc program comes with a monitor so you can see which source files are compiling on which machine. Since you used the --use-gtk option when you built distcc, you should have two choices: distccmon-text and distccmon-gnome. Let's stick with the console version for now. Start a new terminal session and run:

me@mymachine:~/openssh-3.7p1> distccmon-text 2

to update every two seconds. Alternatively (my preferred method):

me@mymachine:~/openssh-3.7p1> watch distccmon-text

Both of these achieve the same thing: present you with a snapshot of the distributed compilation every two seconds. Now that's done, you can run configure. You need to send configure an option so it knows not to use regular gcc, which my Linux system will default to in the absence of any other instructions:

me@mymachine:~/openssh-3.7p1> CC=distcc ./configure

The distcc monitor might "blip" with a little activity as one or two parts of the configuration are done on other machines. After the configure has completed, you're ready to do the actual compilation:

me@mymachine:~/openssh-3.7p1> make -j 12

I've passed the -j option to make. This isn't a distcc-specific thing; rather, the -j flag tells gcc how many things to compile at once. It's perfectly possible to run make with -j on machines not running distcc, and setting -j to 2 on a single CPU can sometimes speed things up (but not significantly). However, we've specified 12, indicating that we should build up to twelve source files at once if possible.

Let's look at our distcc monitor, and see what it's doing:


Listing 1. The distcc command-line monitor
  5366  Preprocess  serve.c                       flim[0]
  5338  Compile     minilzo.c                     flim[1]
  5363  Preprocess  prefork.c                     flim[2]
  5360  Compile     ncpus.c                jabberwocky[0]
  5352  Compile     dparent.c              jabberwocky[1]
  5356  Compile     dsignal.c              jabberwocky[2]
  5349  Compile     dopt.c                   mymachine[0]
  5279  Compile     trace.c                  mymachine[1]
  5375  Preprocess  srvnet.c                 mymachine[2]
  5342  Compile     access.c                      flam[0]
  5346  Compile     daemon.c                      flam[1]
  5371  Preprocess  setuid.c                      flam[2]

Using the distcc monitor, you can see which files are compiling on which nodes. The numbers after the node names on the right-hand side indicate that it's the n-th concurrent compile. Here, since we have four nodes and specified -j as 12, we have three files compiling on each machine. This makes a lot of sense, as there is some network overhead in shuffling the required files around, and if there were only one compile per node (in other words, -j 4) then the CPUs would spend quite some time idling.

Timing these over those machines tells me this takes just short of 9 seconds to compile, which is around a sixteen-times speed increase. Compilation with distcc allows you to take advantage of nodes that are significantly quicker than your own, while still giving you applications built and optimized for your personal workstation.

Just to see the effect different values of -j have, let's try varying them and rebuilding:


Listing 2. How the number of simultaneous compiles affects build time
     -j value                            build time (seconds)

     4                                           19.5
     8                                           10.5
     12                                          8.9
     16                                          8.5
     20                                          8.6

You can see that altering the -j value has benefits, but different configurations yield different results, so it's probably worth experimenting. Another point to note is that if you have more than a handful of machines in your distcc cluster, it's worth removing your local machine from the list, since it will be too busy delegating the various source files to machines and receiving built object files to burden itself with compilation. Indeed, leaving your machine in may slow down the build process.

There is one final point to note with respect to versioning. The distcc program works best if you keep to the same minor version of gcc across all the nodes in the distcc cluster; having different minor versions can cause unstable builds or even fail the build process completely, as parts of gcc have changed enough to cause this. For example, if mymachine, flim, and jabberwocky from the above example were running gcc 3.3.1, and flam was running gcc 3.2.2, then the build of OpenSSH might complete successfully, or it might fail, depending on which parts are built on which machines. Be warned that even a successful build may not function as expected in this instance.

Sticking to the same minor versions (for example, gcc 3.3.4 and gcc 3.3.1 are both gcc 3.3.x, and therefore will be fine with each other) is the best policy as all the builds will be nice and stable, and if they fail, it's probably not a distcc-related issue.



Back to top


Summary

  1. Install distcc on all the machines that you want to use for compilation.
  2. Start the distcc daemon on each of these machines.
  3. Export the DISTCC_HOSTS environment variable with their names.
  4. Start the distcc monitor (so you can see what's going on!).
  5. Instead of configuring with: ./configure, use CC=distcc ./configure.
  6. Instead of making with make or make -j 2, use make -j n, where n is two or three times the number of machines in DISTCC_HOSTS.

If you have programs that would benefit from optimization, then "rolling your own" binaries from the source code is the way to go. This can be comparatively expensive in terms of wall-clock time for larger builds, so using distcc allows you (and everyone else) to mop up those idle CPU cycles on the network and get up and running as quickly as possible.



Resources

  • You can download tarballed sources of distcc from Samba, and of OpenSSH from the OpenSSH site.

  • distcc works with the GNU C compiler (gcc).

  • Laurence prefers to unpack and build sources from tar files instead of using RPMs. Learn more about compiling programs from sources with A beginner's guide to compiling programs under Linux (Linux Users of Victoria), Kim Oldfield's hands-on guide for people who've never compiled a program under Linux before. Compiling Programs on Linux (Linux Gazette, 1999) by JC Pollman includes all of the above and some troubleshooting hints as well. (Don't forget to also read the man pages!)

  • The IBM developerWorks tutorial, Compiling and installing software from sources (developerWorks, 2000) by Daniel Robbins will also get you started.

  • Compiling with optimized flags makes binaries run much faster. To learn more, see Programming Optimization (a zillion monkeys, 2002) by Paul Hsieh as well as Safe flags to use for gentoo-1.4 and Experimental flags to use for gentoo-1.4 (freehackers.org, 2002).

  • The article distcc optimizations by Benjamin Meyer describes using distcc to make a small compiler farm. Benjamin recommends using distcc in conjunction with ccache and unsermake.

  • Find more resources for Linux developers in the developerWorks Linux zone.

  • Browse for books on these and other technical topics.

  • Develop and test your Linux applications using the latest IBM tools and middleware with a developerWorks Subscription: you get IBM software from WebSphere, DB2, Lotus, Rational, and Tivoli, and a license to use the software for 12 months, all for less money than you might think.

  • Download no-charge trial versions of selected developerWorks Subscription products that run on Linux, including WebSphere Studio Site Developer, WebSphere SDK for Web services, WebSphere Application Server, DB2 Universal Database Personal Developers Edition, Tivoli Access Manager, and Lotus Domino Server, from the Speed-start your Linux app section of developerWorks. For an even speedier start, help yourself to a product-by-product collection of how-to articles and tech support.


About the author

Laurence Bonney is a software engineer at IBM Hursley Labs in the United Kingdom. He works as the Technical Team Leader of the test team working on the IBM WebSphere MQ JMS product. In his spare time he plays guitar (badly), goes surfing as much as his vacation will allow, and plays video games. You can reach Laurence at bonneyl@uk.ibm.com.




Rate this page


Please take a moment to complete this form to help us better serve you.



YesNoDon't know
 


 


12345
Not
useful
Extremely
useful
 


Back to top