 | Level: Introductory Andrew Blais (onlymice@gnosis.cx), Researcher and writer, Gnosis Software, Inc
01 Sep 2001 Andrew Blais introduces the concept of Beouwulf clusters, which extensively reduce the time to process software by using multiple CPUs executing program fragments in parallel under Linux or NT. He describes various implementations, the relative performance of the clusters, and the technology needed to make them effective.
As of the middle of 2001, the world's fastest computer can perform, on average, about five trillion floating point operations per second, or 5 teraflops. The machine that ranks 500th averages about 55 gigaflops. In general, such top-tier computing power is quite expensive and unavailable (see Resources later in this article). However, in 1994, Thomas Sterling and Don Becker established that there is a way to use common and affordable hardware plus Linux to bring together the computing power of clusters of relatively smaller machines. The result, called a Beowulf cluster, can inexpensively emulate the computing power of the bottom ranks of the top machines. The Beowulf strategy aims at minimizing computation time. A CPU operating under MS DOS can sequentially process only one program's instructions. We would like to reduce this baseline.
In the case of one CPU running under Linux or NT, it is possible
for multiple programs to share the CPU's resources, but all things being
equal -- like word size and disk access speed -- no program processes
any faster than it would under MS DOS.
Moreover, in the case of Symmetric Multi-Processing (SMP) where two or four CPUs are running under Linux or NT, each processing multiple programs, it still takes approximately the same amount of time to process any single program.
Now, one option for reducing the processing time of a program is to divide it into independent sub-tasks that can be processed by different CPUs. When the results of these sub-tasks are available, they can be returned to one of the processors for final processing. Clearly, this reduces the processing time for a given program. Multi-threaded programs running under Linux on a motherboard with two or four CPUs illustrate this fact. Clearly, a far better option is to divide programs into many independent sub-tasks that can be processed by multiple processors, and then to pipe their results to a single processor that assembles a final output. Even though motherboards generally have a maximum of two or four processors, it is possible, using Ethernet transfers, to extend this strategy. This is the Beowulf strategy: divide programs into many parts that are executed by many CPUs that run under as many copies of Linux, all of which transfer their data and instructions via Ethernet. For example, suppose that you wanted to find the prime numbers between 1 and 1,000,000. You could write code that sequentially tests each integer in this range. This code would process at the above-mentioned baseline. The Beowulf strategy begins by connecting, say, ten machines or nodes via Ethernet. It proceeds to write code that initially runs on a lead node, but then sends the task of testing 100,001-200,000 to subsidiary node 1, and sends the task of testing 200,001-300,000 to subsidiary node 2, and so on. Along with the testing task, there would also be an instruction to return whatever primes a subsidiary node discovered to the lead node. Of course, while the lead node waited for the returns, it would be testing 1-100,000. When all nodes have completed their tasks, the whole machine will have produced a list of primes between 1 and 1,000,000, and it will have done so in a fraction of the time of the sequential strategy (see Resources). Another more prosaic example is the PC that has a base CPU and lots of processors on things such as its modem (that is, if it isn't a winmodem) or its graphics or sound card. Wiglaf: proof of concept
In 1994, Thomas Sterling and Donald Becker built the first computer to employ the Beowulf strategy. Curiously, they didn't name their machine "Beowulf". They called it "Wiglaf" -- the mythic Beowulf's friend (see Resources). Wiglaf had 16 nodes, and each node supported a 100 MHz Intel DX4 processor (at first, these were 66 Mhz 486 chips), 16 MBytes of DRAM, 540 to 1 gigabyte drive, and a pair of 10 Mbps Ethernet cards. Every hardware component was a COTS -- Commodity Off The Shelf. At the end of the day, Wiglaf was capable of about 74 megaflops. Its price was less than $50,000. Wiglaf ran under Linux. There were good reasons for doing so. As noted above, in a Beowulf cluster, each node runs its own copy of the Linux OS. If each node required a license for its OS, the cost of the cluster would increase by the license cost times the number of nodes. The math is easy, when the cost is essentially zero. Moreover, Wiglaf's nodes communicated using Ethernet, but the 10 Mbps Ethernet formed a bottleneck. Becker found that if each node had two or three Ethernet connections, these connections could be made to behave as a single connection that eased the bottleneck (see Resources). This is called channel bonding. It was only possible because Becker could freely read the source of the Linux kernel, and thereby write a customized Ethernet driver. Wiglaf's hardware description seems comically quaint in a time when CPUs run at more that 1 GHz -- not to mention hundreds of megabytes of memory and tens of gigabytes of hard drive space. But Sterling and Becker were only out to establish that the Beowulf strategy could work. Wiglaf was a proof of concept, not a final product. In order to put practical flesh on the theoretical Beowulf bones, however, they tested their machine on two problems that had already been analyzed on other supercomputers. One concerned fluid dynamics, and the other N-body simulations. In the class of problems that it could handle, it did well. In relation to machines such as Intel's Paragon and the TMC CM-5, it stood its ground. The concept had been proved.
Loki: notoriety
Another notable use of the Beowulf strategy was the Loki Cluster housed at the Los Alamos National Laboratory (see Resources). It won the 1998 Gordon Bell Price/Performance Prize. This is remarkable, perhaps ironic even, given Mr. Bell's professional association with the Microsoft Bay Area Research Center and the fact that the Loki cluster consisted of 16 200 Mhz CPUs networked together with Fast Ethernet running under Linux. As mentioned above, most supercomputers have cost millions of dollars, but Loki's cost was approximately $63,000, and again, the cost went to hardware. At the time, Loki was running at 1.2 gigaflops.
Avalon: a contender
In 1998, at the Los Alamos National Laboratory, a Beowulf cluster was built using 140 533 MHz Alpha microprocessors (21164A). Each node had 256 MB of RAM and 3 GB of hard drive space. The nodes were connected by means of Fast Ethernet PCI cards. It ran on RedHat Linux 5.0. According to the Linpack benchmark, Avalon ran at 47.7 gigaflops.
It cost about $313,000, and in terms of price-performace ratio,
this placed the Avalon significantly above the 64 processor Origin 2000
from Silicon Graphics, which produced the same crop of gigaflops for a
cost of approximately $1.8 million.
Note: it placed at number 113 in the list of the world's 500 fastest computers.
Measuring performance
The only two scales on which the performance of a Beowulf cluster can be measured are quantity of floating point operations per second and cost. This is as shortsighted as thinking that the only measure of a processor is its speed measured in MHz. As there are other processor measures such as size of a chip cache, speed of memory bus, and word width, there are at least three other noteworthy measures of Beowulf performance. One is how well network throughput is optimized against the balance between packet size and collision rate. Another is rate of disk input-output. Lastly, there is simple code efficiency. One way of parceling programming tasks may be more or less efficient than another.
The marketplace
The Beowulf strategy finds implementation outside research facilities. For example, Boeing received a Beowulf cluster from Linux NetworX to design the Delta IV rocket, which is itself used to launch satellites. Moreover, IBM worked with the University of New Mexico on LosLobos, 512 Pentium III 733 MHz processors inside IBM Netfinity servers, transferring data with via 64 bit Myrinet, which has reached some 3.75 gigaflops. The National Computational Science Alliance has used LosLobos for research purposes, and IBM continues to market clustering technology based on what it learned there. It has been reported that the University of Sao Paulo, Brazil, built a 125 node Beowulf cluster that was capable of analyzing mammograms in some 1/120th time that it had formerly taken. Companies such as Paralogic and Scyld operate by providing customized clusters (see Resources).
Software you will want
The only piece of software explicitly mentioned thus far has been the Linux OS, but implicit in the concept that data and program instructions are transferred via Ethernet is a set of software protocols that support it. One package that you will undoubtedly want to procure and master is PVM, or Parallel Virtual Machine. PVM is a software package that makes it possible for a cluster of Linux or NT machines to perform as a single computer capable of parallel processing. CPUs need to be able to send and receive tasks and data. Message passing is one way to implement this. At least one CPU needs to be able to send the message that you are to find and send back the primes among these numbers, and at least one CPU needs to send be able to send the message that I found these primes. PVM is an interface with rsh (remote shell) or ssh (secure shell) that makes a group of machines seem to be one -- hence the tag: virtual. The source code for PVM is available, and it has been compiled on many kinds of machines -- there is even a RedHat RPM.
If you have two networked computers, you can try PVM by downloading Rahul U. Joshi's most instructive sample code (see Resources). An alternative to PVM is MPI, which is a library for writing programs that pass messages. MPI has a number of implementations. Some are commercial and others are free, and some offer source code, while others don't (see Resources).
Summary
The Beowulf strategy produces the Beowulf effect, that is, the use
of more processors to compute in less time what it would take fewer
processors to compute in more time. I have only touched the surface of the Beowulf phenomenon in this article. The resources below can help you fill out your understanding of Beowulf clusters and get the tools you need to get started.
Resources - An excellent starting point is
"Parallel Processing on Linux with PVM and MPI" by Rahul U. Joshi in the Linux Gazette. Download his sample code.
- Download the Cluster Starter Kit for Linux from IBM's alphaworks site.
- Learn more about IBM's cluster software.
- Take a tour of open- and closed-source clustering solutions available for Linux in the developerWorks article "Linux clustering cornucopia".
- See the IBM redbook, "Linux HPC Cluster Installation".
- Check out the list of the world's 500 fastest computers.
- See the Accelerated Strategic Computing Initiative (ASCI) for the world's fastest computer, and read the associated article, "IBM Builds World's Fastest Supercomputer
to Simulate Nuclear Testing for U.S. Energy Department".
- Like all things Linux, there is a
Beowulf HOWTO, a mailing list (to subscribe put the word subscribe in the message body) with an
archive, and a
Web page.
- Information about the Loki cluster can be found at Loki - Commodity Parallel Processing.
- For information on The Gordon Bell Price/Performance Prize, see the SC2000 Gordon Bell Awards.
- Visit the Avalon cluster Web page.
- Boeing's use of Beowulf is recounted in a CNet article, "Boeing buys Linux-AMD supercomputer".
- Visit the Linux NetworX Web site.
- Read Building a Beowulf System, which describes the work at Sao Paulo University by Thomas L. Sterling, John Salmon, Donald J. Becker, and Daniel F. Savarese. Also check out How to Build a Beowulf (The MIT Press, 1999).
- Visit the Scyld Web site.
- For an overview of PVM, see Richard A. Sevenich's "Parallel Processing using PVM" from the Linux Journal. Also see Manu Konchady's "Parallel Computing Using Linux" also from the Linux Journal.
- Read about PVM, Parallel Virtual Machine, and MPI, the Message Passing Interface.
- See the list of MPI implementations.
- Browse more Linux resources on developerWorks.
- Browse more Open source resources on developerWorks.
About the author  | |  |
Andrew Blais divides his time between home schooling his son, freelance writing, and teaching philosophy and religion at Anna Maria College. He can be reached at onlymice@gnosis.cx. |
Rate this page
|  |