Linux clustering is popular in many industries these days. With the advent of clustering technology and the growing acceptance of open source software, supercomputers can now be created for a fraction of the cost of traditional high-performance machines.
This two-part article introduces the concepts of High Performance Computing (HPC) with Linux cluster technology and shows you how to set up clusters and write parallel programs. This part introduces the different types of clusters, uses of clusters, some fundamentals of HPC, the role of Linux, and the reasons for the growth of clustering technology. Part 2 covers parallel algorithms, how to write parallel programs, how to set up clusters, and cluster benchmarking.
Most HPC systems use the concept of parallelism. Many software platforms are oriented for HPC, but first let's look at the hardware aspects.
HPC hardware falls into three categories:
- Symmetric multiprocessors (SMP)
- Vector processors
SMP is a type of HPC architecture in which multiple processors share the same memory. (In clusters, also known as massively parallel processors (MPPs), they don't share the same memory; we will look at this in more detail.) SMPs are generally more expensive and less scalable than MPPs.
In vector processors, the CPU is optimized to perform well with arrays or vectors; hence the name. Vector processor systems deliver high performance and were the dominant HPC architecture in the 1980s and early 1990s, but clusters have become far more popular in recent years.
Clusters are the predominant type of HPC hardware these days; a cluster is a set of MPPs. A processor in a cluster is commonly referred to as a node and has its own CPU, memory, operating system, and I/O subsystem and is capable of communicating with other nodes. These days it is common to use a commodity workstation running Linux and other open source software as a node in a cluster.
You'll see how these types of HPC hardware differ, but let's start with clustering.
The term "cluster" can take different meanings in different contexts. This article focuses on three types of clusters:
- Fail-over clusters
- Load-balancing clusters
- High-performance clusters
The simplest fail-over cluster has two nodes: one stays active and the other stays on stand-by but constantly monitors the active one. In case the active node goes down, the stand-by node takes over, allowing a mission-critical system to continue functioning.
Load-balancing clusters are commonly used for busy Web sites where several nodes host the same site, and each new request for a Web page is dynamically routed to a node with a lower load.
These clusters are used to run parallel programs for time-intensive computations and are of special interest to the scientific community. They commonly run simulations and other CPU-intensive programs that would take an inordinate amount of time to run on regular hardware.
Figure 1 illustrates a basic cluster. Part 2 of this series shows you how to create such a cluster and write programs for it.
Figure 1. The basic cluster
Grid computing is a broad term that typically refers to a service-oriented architecture (SOA) with collaboration among loosely coupled systems. Cluster-based HPC is a special case of grid computing in which the nodes are tightly coupled. A successful, well-known project in grid computing is SETI@home, the Search for Extraterrestrial Intelligence program, which used the idle CPU cycles of a million home PCs via screen savers to analyze radio telescope data. A similar successful project is the Folding@Home project for protein-folding calculations.
Almost every industry needs fast processing power. With the increasing availability of cheaper and faster computers, more companies are interested in reaping the technological benefits. There is no upper boundary to the needs of computer processing power; even with the rapid increase in power, the demand is considerably more than what's available.
Proteins molecules are long flexible chains that can take on a virtually infinite number of 3D shapes. In nature, when put in a solvent, they quickly "fold" to their native states. Incorrect folding is believed to lead to a variety of diseases like Alzheimer's; therefore, the study of protein folding is of fundamental importance.
One way scientists try to understand protein folding is by simulating it on computers. In nature, folding occurs quickly (in about one millionth of a second), but it is so complex that its simulation on a regular computer could take decades. This focus area is a small one in an industry with many more such areas, but it needs serious computational power.
Other focus areas from this industry include pharmaceutical modeling, virtual surgery training, condition and diagnosis visualizations, total-record medical databases, and the Human Genome Project.
Seismograms convey detailed information about the characteristics of the interior of the Earth and seafloors, and an analysis of the data helps in detecting oil and other resources. Terabytes of data are needed to reconstruct even small areas; this analysis obviously needs a lot of computational power. The demand for computational power in this field is so great that supercomputing resources are often leased for this work.
Other geological efforts require similar computing power, such as designing systems to predict earthquakes and designing multispectral satellite imaging systems for security work.
Manipulating high-resolution interactive graphics in engineering, such as in aircraft engine design, has always been a challenge in terms of performance and scalability because of the sheer volume of data involved. Cluster-based techniques have been helpful in this area where the task to paint the display screen is split among various nodes of the cluster, which use their graphics hardware to do the rendering for their part of the screen and transmit the pixel information to a master node that assembles the consolidated image for the screen.
These industry examples are only the tip of the iceberg; many more applications including astrophysical simulation, weather simulation, engineering design, financial modeling, portfolio simulation, and special effects in movies demand extensive computational resources. The demand for increasingly more power is here to stay.
Before cluster-based computing, the typical supercomputer was a vector processor that could typically cost over a million dollars due to the specialized hardware and software.
With Linux and other freely available open source software components for clustering and improvements in commodity hardware, the situation now is quite different. You can build powerful clusters with a very small budget and keep adding extra nodes based on need.
The GNU/Linux operating system (Linux) has spurred the adoption of clusters on a large scale. Linux runs on a wide variety of hardware, and high-quality compilers and other software like parallel filesystems and MPI implementations are freely available for Linux. Also with Linux, users have the ability to customize the kernel for their workload. Linux is a recognized favorite platform for building HPC clusters.
To understand HPC hardware, it is useful to contrast the vector computing paradigm with the cluster paradigm. These are competing technologies (Earth Simulator, a vector supercomputer, is still among the top 10 fastest supercomputers).
Fundamentally, both vector and scalar processors execute instructions based on a clock; what sets them apart is the ability of vector processors to parallelize computations involving vectors (such as matrix multiplication), which are commonly used in High Performance Computing. To illustrate this, suppose you have two double arrays a and b and want to create a third array x such that x[i]=a[i]+b[i].
Any floating-point operation like addition or multiplication is achieved in a few discrete steps:
- Exponents are adjusted
- Significands are added
- The result is checked for rounding errors and so on
A vector processor parallelizes these internal steps by using a pipelining technique. Suppose there are six steps (as in IEEE arithmetic hardware) in a floating-point addition as shown in Figure 2.
Figure 2. A six-step pipeline with IEEE arithmetic hardware
A vector processor does these six steps in parallel -- if the ith array elements being added are in their 4th stage, the vector processor will execute the 3rd stage for the (i+1)th element, 2nd stage for the (i+2)th, and so on. As you can see, for a six-step floating-point addition, the speedup factor will be very close to six times (at start and end, not all six stages are active) as all stages are active at any given instant (shown in red in Figure 2). A big benefit is that parallelization is happening behind the scenes, and you need not explicitly code this in your programs.
For the most part, all six steps can be executed in parallel leading to an almost six-fold improvement in performance. The arrows indicate the operations on the ith array index.
Compared to vector processing, cluster-based computing takes a fundamentally different approach. Instead of using specially optimized vector hardware, it uses standard scalar processors, but in large numbers to perform several computations in parallel.
Some features of clusters are as follows:
- Clusters are built using commodity hardware and cost a fraction of the vector processors. In many cases, the price is lower by more than an order of magnitude.
- Clusters use a message-passing paradigm for communication, and programs have to be explicitly coded to make use of distributed hardware.
- With clusters, you can add more nodes to the cluster based on need.
- Open source software components and Linux lead to lower software costs.
- Clusters have a much lower maintenance cost (they take up less space, take less power, and need less cooling).
Software and hardware go hand in hand when it comes to achieving high performance on a cluster. Programs must be written to explicitly take advantage of the underlying hardware, and existing non-parallel programs must be re-written if they are to perform well on a cluster.
A parallel program does many things at once. Just how many depends on the problem at hand. Suppose 1/N of the total time taken by a program is in a part that can not be parallelized, and the rest (1-1/N) is in the parallelizable part (see Figure 3).
Figure 3. Illustrating Amdahl's Law
In theory you could apply an infinite amount of hardware to do the parallel part in zero time, but the sequential part will see no improvements. As a result, the best you can achieve is to execute the program in 1/N of the original time, but no faster. In parallel programming, this fact is commonly referred to as Amdahl's Law.
Amdahl's Law governs the speedup of using parallel processors on a problem versus using only one serial processor. Speedup is defined as the time it takes a program to execute in serial (with one processor) divided by the time it takes to execute in parallel (with many processors):
T(1) S = ------ T(j)
Where T(j) is the time it takes to execute the program when using j processors.
In Figure 3, with enough nodes for parallel processing, T'par can be made close to zero but Tseq does not change. At best, the parallel version cannot be faster than the original by more than 1+Tpar/Tseq.
The real hard work in writing a parallel program is to make N as large as possible. But there is an interesting twist to it. You normally attempt bigger problems on more powerful computers, and usually the proportion of the time spent on the sequential parts of the code decreases with increasing problem size (as you tend to modify the program and increase the parallelizable portion to optimize the available resources). Therefore, the value of N automatically becomes large. (See the re-evaluation of Amdhal's Law in the Resources section later in this article.)
Let's look at two parallel programming approaches: the distributed memory approach and the shared memory approach.
It is useful to think a master-slave model here:
- The master node divides the work between several slave nodes.
- Slave nodes work on their respective tasks.
- Slave nodes intercommunicate among themselves if they have to.
- Slave nodes return back to the master.
- The master node assembles the results, further distributes work, and so on.
Obvious practical problems in this approach stem from the distributed-memory organization. Because each node has access to only its own memory, data structures must be duplicated and sent over the network if other nodes want to access them, leading to network overhead. Keep these shortcomings and the master-slave model in mind in order to write effective distributed-memory programs.
In the shared-memory approach, memory is common to all processors (such as SMP). This approach does not suffer from the problems mentioned in the distributed-memory approach. Also, programming for such systems is easier since all the data is available to all processors and is not much different from sequential programming. The big issue with these systems is scalability: it is not easy to add extra processors.
Parallel programming (like all programming) is as much art as science, always leaving room for major design improvements and performance enhancements. Parallel programming has its own special place in computing; Part 2 of this series examines parallel programming platforms and examples.
Some applications frequently need to read and write large amounts of data to the disk, which is often the slowest step in a computation. Faster hard drives help, but there are times when they are not enough.
The problem becomes especially pronounced if a physical disk partition is shared between all nodes (using NFS, for example) as is popular in Linux clusters. This is where parallel filesystems come in handy.
Parallel filesystems spread the data in a file over several disks attached to multiple nodes of the cluster, known as I/O nodes. When a program tries to read a file, small portions of that file are read from several disks in parallel. This reduces the load on any given disk controller and allows it to handle more requests. (PVFS is a good example of an open source parallel filesystem; disk performance of better than 1 GBsec has been achieved on Linux clusters using standard IDE hard disks.)
PVFS is available as a Linux kernel module and can also be built into the Linux kernel. The underlying concept is simple (see Figure 4):
- A meta-data server stores information on where different parts of the file are located.
- Multiple I/O nodes store pieces of the file (any regular underlying filesystem like ext3 can be used by PVFS).
Figure 4. How PVFS works
When a compute node in a cluster wants to access a file in this parallel filesystem, it goes through the following steps:
- It requests a file as usual, and the request goes to the underlying PVFS filesystem.
- PVFS sends a request to the meta-data server (steps 1 and 2 in Figure 4), which informs the requesting node about the location of the file among the various I/O nodes.
- Using this information, the compute node communicates directly with all the relevant I/O nodes to get all the pieces of the file (step 3).
All this is transparent to the calling application; the underlying complexity of making requests to all the I/O nodes, ordering of the file's contents, etc., are taken care of by PVFS.
A good thing about PVFS is that it allows binaries meant for regular filesystems to run on it without modification -- this is somewhat of an exception in the parallel programming arena. (Some other parallel filesystems are mentioned in Resources.)
- Build your own cluster:
- Building a Linux HPC Cluster with xCAT (IBM Redbook, September 2002) guides system architects and engineers toward a basic understanding of cluster technology, terminology, and the installation of a Linux High Performance Computing (HPC) cluster.
- Linux HPC Cluster Installation (IBM Redbook, June 2001) focuses on xCAT (xCluster Administration Tools) for installation and administration.
- "Building an HPC Cluster with Linux, pSeries, Myrinet, and Various Open Source Software" (IBM Redpaper, July 2004) explains how to bring a collection of pSeries nodes and a Myrinet interconnect to a state where a true HPC production workload can be run in a Linux clustered environment.
- "Build a heterogeneous cluster with coLinux and openMosix" (developerWorks, February 2005) demonstrates how to build a mixed or hybrid HPC Linux cluster.
- Linux Clustering with CSM and GPFS (IBM Redbook, January 2004) focuses on the Cluster System Management and the General Parallel Filesystem for Linux clusters.
Visit the SETI Project, and learn about SETI@home.
For more on protein folding, visit the Folding@Home Project, and get an introduction to protein folding.
Learn about IBM's Deep Computing Capacity on Demand, computing centers designed to handle compute-intensive tasks for petroleum and other industries.
John L. Gustafson explains a re-evaluation of Amdhal's Law.
Learn how to to achieve a gigabyte-per-second I/O throughput with commodity IDE disks using the PVFS.
The five-part series, "Build a digital animation system" (developerWorks, July 2005), showcases how to set up an HPC cluster you can use to run a professional animation system.
Find more resources for Linux developers in the developerWorks Linux zone.
To learn more about Grid computing, read New to Grid computing and check out the Grid zone on developerWorks.
Get products and technologies
Download the Parallel Virtual Filesystem.
Download xCAT on IBM alphaWorks, a tool kit for deploying and administering Linux clusters.
Order the SEK for Linux, a two-DVD set containing the latest IBM trial software for Linux from DB2®, Lotus®, Rational®, Tivoli®, and WebSphere®.
Build your next development project on Linux with IBM trial software, available for download directly from developerWorks.
Get involved in the developerWorks community by participating in developerWorks blogs.
Aditya Narayan holds a BS/MS in Physics from Indian Institute of Technology, Kanpur, and founded QCD Microsystems. He has expertise in Windows and Linux internals, WebSphere, and enterprise platforms like J2EE and .NET. Aditya spends most of his time in New York. You can reach him at firstname.lastname@example.org.