Clustering, which is the process of networking many compute nodes together to act as a single computing solution, has been popular on Intel systems running Linux for years. With the announcement of IBM eServer OpenPower 710, a Linux on POWER-based cluster is an ideal solution because of the 710's one- or two-way 1.65GHz-based POWER5 64-bit processor. This article first introduces some general clustering concepts and software and then provides examples of how the OpenPower 710 can be used in the two most common types of clusters: high performance clusters and high availability clusters.
A cluster can consist of any number of components that work together to perform a certain task. A quick overview of these components is presented in order to provide a better understanding of the example clusters in this paper.
Cluster management involves "keeping tabs" on each of the nodes in a cluster by having one or more management nodes monitor the other nodes, such as compute or storage nodes. The management node is also responsible for performing software updates on the other nodes. There are a large number of tools available for cluster management, ranging from minimal tools such as webmin to larger packages such as IBM Cluster System Management (CSM). For more information on these tools, follow the links under "Cluster management" in the Resources section.
Monitoring nodes of a cluster is a sub-component of cluster management that tracks the status of nodes. Heartbeat is an application used to monitor nodes in a cluster that should be run on its own private network or serial link to eliminate points of failure. When the Heartbeat application detects that a node has gone down, the proper action is performed to ensure cluster stability. For more information about Heartbeat, refer to the link under "Monitoring" in Resources.
Linux offers many different file systems, but a distributed file system provides optimal performance in a clustered environment. Concurrent access to data from all nodes in a cluster and the high availability built into some distributed file systems are a couple of reasons why they are ideal for a cluster environment. Some common distributed file systems are IBM General Parallel File System (GPFS), Intermezzo, and OpenAFS. For more information on these, refer to the links under "File systems" in Resources.
Volume management is the ability to handle storage dynamically without negatively impacting a cluster's storage system. A scenario in which volume management would be useful is when you have a storage server that has filled up with data and storage needs to be added. Volume management allows you to add storage to the existing file system without disrupting the storage server. A couple of facilities are Logical Volume Management (LVM) and the Enterprise Volume Management System (EVMS). For more information on these facilities, follow the links under "Volume management" in the Resources section.
Job management ensures resource management in a cluster. Job management software manages the run jobs that can be run in a cluster and decides which nodes have the proper resources available to run these jobs. Two available job management packages are Balance and Torque. For more information on Balance and Torque, refer to the links under "Job management" in Resources.
Scheduling is a sub-component of job management that handles the assignment of jobs to resources in a cluster. In a high performance cluster, a high priority is placed on scheduling, as it is important to maximize CPU utilization across all nodes when working with complex computational problems. Condor and Maui are examples of schedulers that are available for Linux. For more information, see the links under "Schedulers" in the Resources section.
Message passing involves the efficient passing of data between nodes in a high performance cluster. The Message Passing Interface (MPI) defines a library specification that describes how applications should be written to take advantage of parallel computing concepts. Mpich is a software package that uses MPI and provides for efficient parallel computing. For more information on MPI and Mpich, refer to the links under "Message passing" in Resources.
High performance cluster example
The OpenPower 710 is an ideal solution for high performance clustering with its POWER5 processor-based 64-bit architecture and SMT capabilities. In this example, the Linpack benchmark suite will be the high performance application that runs in the cluster. The cluster components we will focus on are node management, monitoring, and scheduling.
The high performance cluster described in this example consists of five OpenPower 710 servers: one management node and four compute nodes. Instead of installing a high performance application package, the Linpack benchmark is installed on this cluster in order to test performance. The Linpack benchmark is designed to solve a large number of dense linear equations. You will find a link to a parallel implementation of this benchmark in Resources.
CSM software runs on the management node and provides a single point of control for node management. The management node will communicate with the compute nodes on the "management" network, which is a private network running on eth1 as shown in Figure 1. Definitions for each of the compute nodes exist on the management node. CSM provides the flexibility to define the list of applications that need to be installed on each compute node. This can be tuned to support the type of high performance software that is needed. After the compute nodes are installed, the management node monitors the compute nodes and provides software updates on the compute nodes using functions provided by CSM.
Figure 1. High performance cluster
Parallel computing involves combining large amounts of data and spreading that data across each of the compute nodes for faster processing. As shown in Figure 1, the Message Passing Interface (MPI) library specification passes the data between the nodes. The parallel version of the Linpack benchmark requires that an MPI library be installed. In this example we use the MPICH package. (See Resources for a link.)
Communication between the compute nodes occurs on the "compute" network, which is a private network running on eth0, as shown in Figure 1. Monitoring is important in a high performance cluster since it is essential to know when a compute node has failed or when the desired performance is not being achieved. In this example, the Heartbeat application is installed and run over a serial link between the nodes in the cluster. We chose a serial link to reduce the points of failure.
You should now have a small high performance cluster set up that can exercise the parallel built Linpack benchmark. If a compute node crashes, the management node can log it and you will know that while the benchmark that ran is not accurate, the proper action is being taken to restore the failed compute node. Once the node is up and running again, you can rerun the benchmark and analyze the results.
High availability cluster example
A high availability (HA) cluster contains enough redundant resources to provide system functionality in case of any component failure. This example presents a simple HA cluster that consists of three nodes: a master node, a backup node, and a management node. The master and the backup node is connected to an IBM DS4500 storage server. This cluster is typical in a small Web serving environment.
With redundancy being the key feature of an HA cluster, the components that are essential in this example are node management, a distributed file system, volume management, and monitoring. This example shows how fault tolerance is possible using two OpenPower 710 models: one acting as a primary server and the other acting as a backup server. Monitoring software is used to detect whether the primary server goes down or not.
As with the high performance cluster described above, CSM will be installed on the management node. The management node is connected to the master and backup nodes on a private management network running on eth1, as shown in Figure 2. The management node installs both the master and backup nodes and provides software updates as required.
Figure 2. High availability cluster
The Heartbeat package provides the software to perform the monitoring of the nodes. In order to reduce the number of points of failure, a serial connection is used instead of a network connection to link the nodes that will be monitored. Heartbeat is configured to switch over to the backup node if the master node happens to go down. This is a fairly high-level view of how the Heartbeat application works, but more information can be found at the High-Availability Linux Project Website.
GPFS is the distributed file system running on the DS4500 storage server. The storage server is connected to the master and backup nodes with a high speed myrinet or fibre-channel connection, as shown in Figure 2. The combination of GPFS and the DS4500 provides for a redundant storage subsystem in case of disk failure. GPFS allows both the master and backup nodes to have concurrent access to all of the files in the cluster.
Volume management is handled by installing LVM on the cluster. LVM provides the ability to dynamically add and remove storage to the DS4500 without interrupting service to the running cluster. For instance, if the file system in which you are storing your Web content fills up, LVM allows you to add another disk to the file system to increase storage without impacting your existing data.
You should now have all the components necessary to handle a failure of the master node, if it should occur. If the Heartbeat application notices that the master node is not available, the backup node will then become active with the proper network configuration. The backup node now handles the Web serving at this point, and no failure is noticed from the outside, routable, network. Meanwhile, maintenance can be performed on the master node to analyze why it went down.
The OpenPower 710, with its 64-bit POWER5 processor-based architecture, is an ideal solution for high availability and high performance clusters. In addition, much of the Linux clustering software that is currently available has been ported to the 64-bit POWER architecture, making this entry-level server an affordable 64-bit platform for a Linux on POWER-based cluster solution.
I would like to acknowledge Linda Kinnunen for her document template and helpful reviews and Brent Baude and Steve Dibbell for their technical reviews of this document.
- Participate in the discussion forum.
-
Cluster management:
-
Monitoring:
-
File systems:
-
Volume management:
-
Job management:
-
Schedulers:
-
Message passing:
-
The Linpack benchmark is designed to solve a large number of dense linear equations.
- The Linux on POWER ISV Resource Center is IBM's site for independent software vendors interested in enabling and promoting their applications for Linux on POWER.
- Go to the Linux on POWER Architecture developer's corner for resources for application and system programmers, independent software vendors, and IBM Business Partners who are building or evaluating software for Linux on Power Architecture-based systems.
- Find more resources for Linux developers in the developerWorks Linux zone.
- Get involved in the developerWorks community by participating in developerWorks blogs.
-
Browse for books on these and other technical topics.
John Engel is a Linux technical consultant for the IBM eServer Solutions Enablement organization at IBM. He is based in Rochester, MN. John's main role is to help solution developers bring their applications to Linux on POWER. While working at IBM, he has also held various positions in Linux software development.
Comments (Undergoing maintenance)





