High performance Linux clustering, Part 2

Build a working cluster

Writing parallel programs and configuring your system

Content series:

This content is part # of # in the series: High performance Linux clustering, Part 2

Stay tuned for additional content in this series.

This content is part of the series:High performance Linux clustering, Part 2

Stay tuned for additional content in this series.

Part 1 of this series, Clustering fundamentals, discusses the types of clusters, uses of clusters, HPC fundamentals, and reasons for the growth of clustering technology in High Performance Computing.

This article covers parallel algorithms, and shows you how to write parallel programs, set up clusters, and benchmark clusters. We look at parallel programming using MPI and the basics of setting up a Linux cluster. In this article, meet OSCAR, an open source project that helps you set up up robust clusters. Also, get an overview of cluster management and benchmarking concepts, complete with detailed steps to run the standard LINPACK tests on a cluster.

If you have installed Linux, you will be able to install and troubleshoot a Linux cluster after you read this article. And the helpful links in Related topics below will help you get learn more about clustering.

Clusters and parallel programming platforms

As you saw in Part 1, HPC is mostly about parallel programming. Parallel programming is a fairly well-established field, and several programming platforms and standards have evolved around it over the past two decades.

The two common hardware platforms used in HPC are shared memory systems and distributed memory systems. For details, refer to Part 1.

On shared memory systems, High Performance FORTRAN is a language suited for parallel programming. It makes effective use of data parallelism and can act on entire arrays at once by executing instructions on different indexes of an array in different processors. Consequently this provides automatic parallelization with minimal effort on your part. (The Jamaica project is an example where a standard Java program is re-factored using a special compiler to generate multithreaded code. The resulting code then automatically takes advantage of SMP architecture and executes in parallel.)

On distributed memory systems, the situation is radically different because the memory is distributed; you must write code that is aware of the underlying distributed nature of the hardware and use explicit message passing to exchange messages between different nodes. Parallel Virtual Machines (PVM) were once a popular parallel programming platform for this, but lately MPI has become the de facto standard for writing parallel programs for clusters.

High-quality implementations for MPI are freely available for FORTRAN, C, and C++ for Linux. Two popular MPI implementations are


Later in this article, we'll set up an MPICH-based Linux cluster and see an example of an MPI-based program.

Creating a simple Linux cluster

One of the most interesting things about clustering is that you can build Linux-based clusters with minimal effort if you have basic Linux installation and troubleshooting skills. Let's see how this is done.

For our cluster we'll use MPICH and a set of regular Linux workstations. For simplicity, and to emphasize fundamentals, we'll build just the bare minimum system that you can use to run a parallel program in a clustered environment.

The seven steps in this section show how to build our bare-bones system. Building robust clusters and managing them involve more effort and will be covered later in this article.

Step 1

You need at least two Linux machines if you want a real cluster. Two VMware images will also do just fine. (With VMware, obviously you do not expect any performance benefits. In fact, there will definitely be a performance hit because the CPU will be shared.) Make sure these machines can ping each other by name. If not, add appropriate entries in /etc/hosts.

Step 2

Install the GNU C compiler and GNU FORTRAN compiler.

Step 3a

Configure SSH for all your nodes to allow it to execute commands without being asked for a password. The aim is to get something like ssh -n host whoami to work without being asked for a password. SSH will be used as the way to communicate between different machines. (You can use rsh also for this purpose.)

Step 3b

ssh-keygen -f /tmp/key -t dsa will give you a private key in a file called key and a public key in a file called

Step 3c

If you are building your cluster as root and will be running your programs as root (obviously do this only during experimentation), then copy the private key into the file /root/.ssh/identity and the public key into the file /root/.ssh/authorized_keys in all nodes of your cluster.

To confirm that everything works fine, execute the following command: ssh -n hostname 'date' and see if the command executes without errors. You should test this for all nodes to be sure.

Note: You may have to configure your firewall to allow the nodes to communicate with each other.

Step 4a

Now we'll install MPICH. Download the UNIX version of MPICH from the Web site (see Related topics for a link. The following is a quick overview.

Step 4b

Let's say you downloaded mpich.tar.gz in /tmp:

cd /tmp
tar -xvf mpich.tar.gz (Let's say you get a directory /tmp/mpich-1.2.6 as a result)
cd /tmp/mpich-1.2.6

Step 4c

./configure -rsh=ssh -- this tells MPICH to use ssh as its communication mechanism.

Step 4d

make -- at the end of this step you should have your MPICH installation ready for use.

Step 5

To make MPICH aware of all your nodes, edit the file /tmp/mpich-1.2.6/util/machines/machines.LINUX and add hostnames of all nodes to this file so that your MPICH installation becomes aware of all your nodes. If you add nodes at a later stage, edit this file.

Step 6

Copy the directory /tmp/mpich-1.2.6 to all nodes in your cluster.

Step 7

Run a few test programs from the examples directory:

  • cd /tmp/mpich-1.2.6/utils/examples
  • make cpi
  • /tmp/mpich-1.2.6/bin/mpirun -np 4 cpi -- instructs MPICH to run it on four processors; if your setup has less than four, don't worry; MPICH will create processes to compensate for the lack of physical hardware.

Your cluster is ready! As you can see, all the heavy lifting is being done by the MPI implementation. As noted earlier, this is a bare-bones cluster, and much of it is based on doing manual work by making sure machines can communicate with each other (ssh is configured, MPICH is manually copied, etc.).

Open source cluster application resources

Clearly, it will be hard to maintain the cluster above. It is not convenient to copy files to every node, set up SSH and MPI on every node that gets added, make appropriate changes when a node is removed, and so on.

Fortunately, excellent open source resources can help you set up and manage robust production clusters. OSCAR and Rocks are two examples. Most of the things we did to create our cluster are handled by these programs in an automated manner.

Figure 1 is a schematic of a simple cluster.

Figure 1. A simple sample cluster
A simple sample cluster
A simple sample cluster

OSCAR also supports automated Linux installations using PXE (Portable Execution Environment). OSCAR also helps with these functions:

  • Automated Linux installation on compute nodes.
  • DHCP and TFTP configuration (for Linux installation using PXE). Most new computers have a BIOS that allows you to boot the machine using a DHCP server. The BIOS has a built-in DHCP client that requests an operation system image that is transferred to it from the DHCP server using TFTP. This Linux image is created by OSCAR, and the DHCP and TFTP installation and configuration are also handled by OSCAR.
  • SSH configuration.
  • Automatic NFS setup.
  • MPI installation and configuration (both MPICH and LAM/MPI).
  • PVM installation and configuration (if you want to use PVM instead of MPI).
  • Private subnet configuration between head node and compute nodes.
  • Scheduler (Open PBS and Maui) installation for automatic management of multiple users submitting jobs to the cluster.
  • Ganglia installation for performance monitoring.
  • Provision to add/remove nodes.

Currently OSCAR supports several versions of Red Hat Linux; for other supported distributions, visit the OSCAR Web site (see Related topics for a link). You may have to tweak a few of the scripts based on the errors you get during setup.

Creating a Linux cluster using OSCAR

Let's start with OSCAR sources and build a fully functional cluster. Say you have two or more regular workstations at your disposal with network connectivity. Make one of these the head node and the rest the compute nodes.

As we did when building the Linux cluster, we'll go through the steps to be performed on the head node. OSCAR will configure all other nodes automatically, including the OS installation. (See the OSCAR installation guide; the following is a conceptual overview of the installation process.)

Step 1

Install Linux on the head node. Make sure you install an X server also.

Step 2

mkdir /tftpboot, mkdir /tftpboot/rpm. Copy all RPMs from the installation CD to this directory. These RPMs are used to build the client image. Not all RPMs will be eventually required, but it's good to have them to automatically build the image.

Step 3

It helps to ensure that MySQL is installed and correctly configured and that you can access MySQL from Perl since OSCAR stores all its configuration information in MySQL and uses Perl to access it. This step is usually optional and OSCAR does this for you, but sometimes this step fails, especially if you are installing OSCAR on an unsupported distribution.

Step 4

Download OSCAR sources and compile:

make install

Step 5

Start the OSCAR wizard. Assuming you want the cluster to use eth1 for connectivity between the cluster nodes, use /opt/oscar/install_cluster eth1.

Step 6

At this stage, go through the OSCAR screens step by step. Be sure to execute the steps in correct order:

  1. Select packages to customize your OSCAR install. If you are not familiar with these packages, ignore this for the moment.
  2. Install OSCAR packages.
  3. Build Client image. This what the compute nodes will use.
  4. Define OSCAR clients. This defines the compute nodes. You need to specify the number of nodes you want to use and the subnet they will use. If you're not sure now how many nodes you'll have, you can modify this at a later stage.
  5. Map MAC address of different nodes to IP addresses. For this step, each node must be booted using the PXE network boot option in the BIOS.

Step 7

Finally, run the tests. If all goes well, each test should succeed. Sometimes the tests fail on the first try even when nothing is wrong. You can always run the tests by manually executing the test scripts from /opt/oscar.

If you want to add new nodes now, start the OSCAR wizard again and add nodes. The Linux OS will be installed on these nodes automatically by OSCAR using PXE.

Now that the cluster is ready, you can run parallel programs, add or remove new nodes based on need, and monitor the status with Ganglia.

Managing the cluster

When it comes to managing a cluster in a production environment with a large user base, job scheduling and monitoring are crucial.

Job scheduling

MPI will start and stop processes on the various nodes, but this is limited to one program. On a typical cluster, multiple users will want to run their programs, and you must use scheduling software to ensure optimal use of the cluster.

A popular scheduling system is OpenPBS, which is installed automatically by OSCAR. Using it you can create queues and submit jobs on them. Within OpenPBS, you can also create sophisticated job-scheduling policies.

OpenPBS also lets you view executing jobs, submit jobs, and cancel jobs. It also allows control over the maximum amount of CPU time available to a particular job, which is quite useful for an administrator.


An important aspect of managing clusters is monitoring, especially if your cluster has a large number of nodes. Several options are available, such as Ganglia (which comes with OSCAR) and Clumon.

Ganglia has a Web-based front end and provides real-time monitoring for CPU and memory usage; you can easily extend it to monitor just about anything. For example, with simple scripts you can make Ganglia report on CPU temperatures, fan speeds, etc. In the following sections, we will write some parallel programs and run them on this cluster.

A parallel algorithm

Parallel programming has its own set of parallel algorithms that take advantage of the underlying hardware. Let's take a look at one such algorithm. Let's say one node has a set of N integers to add. The time to do this in the regular way scales as O(N) (if it takes 1ms to add 100 integers, it takes roughly 2ms to add 200 integers, and so on).

It may seem hard to improve upon the linear scaling of this problem, but quite remarkably, there is a way. Let's say a program is executing in a cluster with four nodes, each node has an integer in its memory, and the aim is to add these four numbers.

Consider these steps:

  1. Node 2 sends its integer to node 1, and node 4 sends its integer to node 3. Now nodes 1 and 3 have two integers each.
  2. The integers are added on these 2 nodes.
  3. This partial sum is sent by node 3 to node 1. Now node 1 has two partial sums.
  4. Node 1 adds these partial sums to get the final sum.

As you can see, if you originally had 2N numbers, this approach could add them in ~N steps. Therefore the algorithm scales as O(log2N), which is a big improvement over O(N) for the sequential version. (If it takes 1ms to add 128 numbers, it takes (8/7) ~ 1.2ms to add 256 numbers using this approach. In the sequential approach, it would have taken 2ms.)

This approach also frees up half the compute nodes at each step. This algorithm is commonly referred to as recursive halving and doubling and is the underlying mechanism behind the class of reduce function calls in MPI, which we discuss next.

Parallel algorithms exist for a lot of practical problems in parallel programming.

Parallel programming using MPI to multiply a matrix by a vector

Now that you know the fundamentals of parallel programming platforms and have a cluster ready, let's write a program that makes good use of the cluster. Instead of a traditional "hello world", let's jump straight to the real thing and write an MPI-based program to multiply two matrices.

We'll use the algorithm described in the section on parallel algorithms to solve this problem elegantly. Let's say we have a 4X4 matrix that we want to multiply with a vector (a 4X1 matrix). We'll make a slight change to the standard technique for multiplying matrices so that the previous algorithm can be applied here. See Figure 2 for an illustration of this technique.

Figure 2. Multiplying a matrix by a vector
Multiplying a matrix by a vector
Multiplying a matrix by a vector

Instead of multiplying the first row by the first column and so on, we'll multiply all elements in the first column by the first element of the vector, second column elements by the second element of the vector, and so on. This way, you get a new 4X4 matrix as a result. After this, you add all four elements in a row for each row and get the 4X1 matrix, which is our result.

The MPI API has several functions that you can apply directly to solve this problem, as shown in Listing 1.

Listing 1. Matrix multiplication using MPI

To compile: 'mpicc -g -o matrix matrix.c'
To run: 'mpirun -np 4 matrix'

"-np 4" specifies the number of processors.


#include <stdio.h>
#include <mpi.h>
#define SIZE 4

int main(int argc, char **argv) {
   int j;
   int rank, size, root;
   float X[SIZE];
   float X1[SIZE];
   float Y1[SIZE];
   float Y[SIZE][SIZE];
   float Z[SIZE];
   float z;
   root = 0;
   /* Initialize MPI. */
   MPI_Init(&argc, &argv);
   MPI_Comm_rank(MPI_COMM_WORLD, &rank);
   MPI_Comm_size(MPI_COMM_WORLD, &size);

/* Initialize X and Y on root node. Note the row/col alignment. This is specific  to C */
   if (rank == root) {
      Y[0][0] = 1; Y[1][0] = 2; Y[2][0] = 3; Y[3][0] = 4;
      Y[0][1] = 5; Y[1][1] = 6; Y[2][1] = 7; Y[3][1] = 8;
      Y[0][2] = 9; Y[1][2] = 10;Y[2][2] = 11;Y[3][2] = 12;
      Y[0][3] = 13;Y[1][3] = 14;Y[2][3] = 15;Y[3][3] = 16;
      Z[0] = 1;
      Z[1] = 2;
      Z[2] = 3;
      Z[3] = 4;
   /*  root scatters matrix Y in 'SIZE' parts to all nodes as the matrix Y1 */
   /* Z is also scattered to all other nodes from root. Since one element is sent to
      all nodes, a scalar variable z is used. */
   MPI_Scatter(Z,1,MPI_FLOAT,&z,1,MPI_FLOAT, root,MPI_COMM_WORLD);
   /* This step is carried out on all nodes in parallel.*/
      X1[j] = z*Y1[j];
   /* Now rows are added, using MPI_SUM (using recursive halving and doubling algorithm,
      internal to the MPI implementation) */
   if (rank == 0) {
       return 0;

Measuring performance

Clusters are built to perform, and you need to know how fast they are. It's common to think that the processor frequency determines performance. While this is true to a certain extent, it is of little value in comparing processors from different vendors or even different processor families from the same vendor because different processors do different amounts of work in a given number of clock cycles. This was especially obvious when we compared vector processors with scalar processors in Part 1.

A more natural way to compare performance is to run some standard tests. Over the years a test known as the LINPACK benchmark has become a gold standard when comparing performance. It was written by Jack Dongarra more than a decade ago and is still used by (see Related topics for a link).

This test involves solving a dense system of N linear equations, where the number of floating-point operations is known (of the order of N^3). This test is well suited to speed test computers meant to run scientific applications and simulations because they tend to solve linear equations at some stage or another.

The standard unit of measurement is the number of floating-point operations or flops per second (in this case, a flop is either an addition or a multiplication of a 64-bit number). The test measures the following:

  • Rpeak, theoretical peak flops. In a June 2005 report, IBM Blue Gene/L clocks at 183.5 tflops (trillion flops).
  • Nmax, the matrix size N that gives the highest flops. For Blue Gene this number is 1277951.
  • Rmax, the flops attained for Nmax. For Blue Gene, this number is 136.8 tflops.

To appreciate these numbers, consider this: What IBM BlueGene/L can do in one second, your home computer may take up to five days.

Now let's discuss how you can benchmark your own Linux cluster. Other tests besides LINPACK are the HPC Challenge Benchmark and the NAS benchmarks.

Benchmarking your Linux cluster

To run the LINPACK benchmark on your Linux cluster, you need to get the parallel version of LINPACK and configure it for the cluster. We'll go through this process step by step.

Warning: The following uses generic linear algebra libraries; use it only as a guideline. For a true test, get libraries that have been optimized for your environment.

Step 1

Download hpl.tgz, the parallel (MPI) version of the LINPACK benchmark, from (see Related topics for a link).

Step 2

Download blas_linux.tgz, pre-compiled BLAS (Basic Linear Algebra Subprograms), also from For simplicity, you can use a pre-compiled reference implementation of BLAS for Linux, but for better results, you should use BLAS provided by your hardware vendor or auto-tune it using the open source ATLAS project .

Step 3

mkdir /home/linpack; cd /home/linpack (we'll install everything in /home).

Step 4

Uncompress and expand blas_linux.tgz. You should get an archive file called blas_linux.a. If you can see the file, ignore any errors you may get.

Step 5

Uncompress and expand hpl.tgz. You'll get a directory called hpl.

Step 6

Copy any of the configuration files, such as the Make.Linux_PII_FBLAS file, from hpl/setup to the hpl directory and rename the copy in the hpl directory Make.LinuxGeneric.

Step 7

Edit the file Make.LinuxGeneric as follows, changing the values to suit your environment:

TOPdir = /home/linpack/hpl
MPdir = /tmp/mpich-1.2.6
LAlib = /home/linpack/blas_linux.a
CC = /tmp/mpich-1.2.6/bin/mpicc
LINKER = /tmp/mpich-1.2.6/bin/mpif77

These five places specify the top-level directory of LINPACK from step 1, the top-level directory of your MPICH installation, and the location of the reference BLAS archive from step 2.

Step 8

Compile HPL now:

make arch=LinuxGeneric

If there are no errors, you will get two files, xhpl and HPL.dat, in /home/linpack/hpl/bin/LinuxGeneric.

Step 9

Before you run the LINPACK benchmark on your cluster, copy the entire /home/linpack to all the machines in your cluster. (If you created the cluster using OSCAR and have NFS shares configured, skip this step.)

Now cd to the directory with the executable you created in step 8 and run some tests (such as /tmp/mpich-1.2.6/bin/mpicc -np 4 xhpl). You should see some tests being executed with the results presented as GFLOPs.

Note: The above will run some standard tests based on default settings for matrix sizes. Use the file HPL.dat to tune your tests. Detailed information on tuning is in the file /home/linpack/hpl/TUNING.

The IBM Blue Gene/L

Now that you've built our own cluster, let's take a quick look at Blue Gene/L, the state-of-the-art in cluster-based supercomputers. Blue Gene/L is an engineering feat of massive proportions, and a discussion that does justice to it would be clearly outside the scope of this article. We'll just touch the surface here.

Three years ago when the Earth Simulator, a vector processor, became the fastest supercomputer, many people predicted the demise of the cluster as a supercomputing platform and the resurrection of the vector processor; this turned out to be premature. Blue Gene/L beat the Earth Simulator decisively on the standard LINPACK benchmark.

Blue Gene/L is not built from commodity workstations, but it does use the standard clustering concepts discussed in this series. It uses an MPI implementation derived from MPICH that we used in our cluster. It also runs standard 32-bit PowerPC® CPUs (although these are based on system-on-a-chip or SoC technology) at 700MHz, which leads to dramatically lower cooling and power requirements than any machine in its class.

Blue Gene/L is a big machine and has 65,536 compute nodes (each of which runs a custom operating system) and 1,024 dedicated I/O nodes (running the Linux kernel). With such a large number of nodes, networking takes special importance and Blue Gene/L uses five different types of networks, each optimized for a particular purpose.

A wealth of information on Blue Gene/L is available in a recent IBM Systems Journal devoted entirely to it. In Blue Gene/L, we see a realization of a highly scalable MPP and clear proof that the cluster-based computing paradigm is here to stay.

Downloadable resources

Related topics

  • Read the first article in this two-part series, "Clustering fundamentals" (developerWorks, September 2005).
  • Parallel Virtual Machine (PVM)
    • PVM is a software package that permits a heterogeneous collection of UNIX® and/or Windows® computers hooked together by a network to be used as a single large parallel computer.
    • This downloadable PostScript article explains PVM's advantages over MPI.
  • Message Passing Interface (MPI)
    • The MPI Forum is an open group with representatives from many organizations that define and maintain the MPI standard.
    • MPICH is a freely available, portable implementation of MPI.
    • LAM/MPI is a high-quality open-source implementation of the MPI specification intended for production as well as research use.
    • This MPI installation guide can get your started.
  • OSCAR (Open Source Cluster Application Resources) is a snapshot of the best known methods for building, programming, and using HPC clusters.
  • Rocks Cluster is an open source Linux cluster implementation.
  • Clumon is a cluster-monitoring system developed at the National Center for Supercomputing Applications to keep track of its Linux super-clusters.
  • Benchmarking
    • And here is an excellent paper by Jack Dongarra on the HPC Challenge benchmark and a thorough introduction to it.
    • The NAS Parallel Benchmarks (NPB) are a small set of programs designed to help evaluate the performance of parallel supercomputers.
    • HPL is a portable implementation of the high-performance LINPACK benchmark for distributed-memory computers.
    • BLAS, the Basic Linear Algebra Subprograms, are routines that provide standard building blocks for performing basic vector and matrix operations.
    • The ATLAS (Automatically Tuned Linear Algebra Software) project is an ongoing research effort focusing on applying empirical techniques in order to provide portable performance. At present, it provides C and Fortran77 interfaces to a portably efficient BLAS implementation.
  • This IBM Journal of Research and Development highlights BlueGene/L.
  • Build your own
    • Building a Linux HPC Cluster with xCAT (Redbook, September 2002) guides system architects and engineers toward a basic understanding of cluster technology, terminology, and the installation of a Linux High-Performance Computing (HPC) cluster.
  • Download benchmarks
  • Build your next development project on Linux with IBM trial software, available for download directly from developerWorks.
ArticleTitle=High performance Linux clustering, Part 2: Build a working cluster