Installing a large Linux cluster, Part 4

Node installation and GPFS cluster configuration

Large Linux cluster storage backend


Content series:

This content is part # of # in the series: Installing a large Linux cluster, Part 4

Stay tuned for additional content in this series.

This content is part of the series:Installing a large Linux cluster, Part 4

Stay tuned for additional content in this series.

This is the fourth and final part of a series that covers the installation and setup of a large Linux computer cluster. The purpose of the series is to bring together in one place up-to-date information from various sources in the public domain about the process required to create a working Linux cluster from many separate pieces of hardware and software. These articles are not intended to provide the basis for the complete design of a new large Linux cluster; refer to the relevant reference materials and IBM Redbooks® mentioned throughout for general architecture pointers.

This series addresses systems architects and systems engineers to plan and implement a Linux cluster using the IBM eServer™ Cluster 1350 framework (see Related topics for more information about the framework). Some parts might also be relevant to cluster administrators for educational purposes and during normal cluster operation. Each part of this article refers to the same example installation.

Part 1 of the series provides detailed instructions for setting up the hardware for the cluster. Part 2 takes you through the next steps after hardware configuration: software installation using the IBM systems management software, Cluster Systems Management (CSM), and node installation.

Part 3 and this article describe the storage back-end of the cluster, covering the storage hardware configuration and the installation and configuration of the IBM shared file system, General Parallel File System (GPFS). Part 3 takes you through the architecture of the storage system, hardware preparation, and details about setting up a Storage Area Network. This, the fourth and final part of the series, provides details about CSM specifics related to the storage backend of the example cluster, notably performing node installation for the storage system, and GPFS cluster configuration.

Detailing node installation specifics

This section details the cluster server management (CSM) specifics related to the storage backend of the example cluster. These include the installation of the GPFS code on each node and the configuration of the Qlogic adapters for the storage nodes. Note that this configuration does not have to be performed using CSM; it can be done manually. The example in this article uses CSM to almost totally automate the installation of a new server, including a storage server.

Reviewing matters of architecture

Before reading this section, you can benefit from reviewing the General cluster architecture section in Part 1 of this series. You can also benefit from reading the section on storage architecture in Part 3. An understanding of the architecture will give you the context you need to make the best use of the information that follows.

Installing storage nodes in the correct order

Installing in the correct order is necessary to work around the ROM overflow section issue described later, because the xSeries™ 346 systems used in this configuration do not have RAID 7K cards. Complete the steps in the following order:

  1. Run the command csmsetupks -vxn <nodename> on the management server.
  2. Disconnect the storage server from the SAN to avoid installation of the operating system on SAN disks, which are discovered first.
  3. Run installnode -vn <nodename> on the management server.
  4. Press F1 from the console when the storage node reboots to enter the BIOS.
  5. Go into Start Options and change PXEboot from disabled to enabled for planar ethernet 1.
  6. Let the node reboot, and installation starts.
  7. Monitor installation from the terminal server, letting the node boot fully.
  8. Log into the node and monitor the csm installation log /var/log/csm/install.log.
  9. Reboot the node when the post reboot tasks have finished.
  10. Press F1 from the console when the node restarts to enter the BIOS.
  11. Go into Start Options and change PXEboot to disabled.
  12. Plug in the SAN cables and let the node boot fully.
  13. Configure the paths to disk using MSJ as explained under Configuring paths to disks and load balancing.

Providing passwordless root access between nodes

GPFS requires that all nodes in the GPFS cluster have the ability to access each other using the root ID with no password provided. GPFS uses this internode access to allow any node in the GPFS cluster to run relevant commands on other nodes. For the example here, secure shell (ssh) is used to provide access, however you can also use remote shell (rsh). To do this, create a cluster-wide key and associated configuration files, and distribute them with CSM following these steps:

  1. Create two new directories at /cfmroot/root/.ssh and /cfmroot/etc/ssh.
  2. Create an RSA key pair, public and private keys for authentication, by typing
    ssh-keygen -b 1024 -t rsa -f /cfmroot/etc/ssh/ssh_host_rsa_key -N "" -C "RSA_Key"
  3. Create a DSA key pair, public and private keys for authentication, by typing
    ssh-keygen -b 1024 -t dsa -f /cfmroot/etc/ssh/ssh_host_dsa_key -N "" -C "DSA_Key"
  4. Create an authorization file containing the public keys, as shown below. This is the file SSH uses to determine whether to prompt for a password.
    cat /root/.ssh/ > /cfmroot/root/.ssh/authorized_keys2
    cat /root/.ssh/ >> /cfmroot/root/.ssh/authorized_keys2
    cat /cfmroot/etc/ssh/ >> /cfmroot/root/.ssh/authorized_keys2
    cat /cfmroot/etc/ssh/ >> /cfmroot/root/.ssh/authorized_keys2
  5. Stop CSM from maintaining the known_hosts file, as shown below. This is a file containing names of hosts. If a host is listed in the file, SSH does not prompt the user for connection confirmation. CSM attempts to maintain this file, but in a fixed cluster environment with passwordless root access, this can be a hindrance.
    stopcondresp NodeFullInstallComplete SetupSSHAndRunCFM
    startcondresp NodeFullInstallComplete RunCFMToNode
    perl -pe 's!(.*update_known_hosts.*)!#$1!' -i /opt/csm/csmbin/RunCFMToNode
  6. Generate a system-wide known_hosts file. This is best done by creating a script, as shown below. Run the script and direct the output to /cfmroot/root/.ssh/known_hosts.
    RSA_PUB=$(cat "/cfmroot/etc/ssh/")
    DSA_PUB=$(cat "/cfmroot/etc/ssh/")
    for node in $(lsnodes); do
      ip=$(grep $node /etc/hosts | head -n 1 | awk '{print $1}')
      short=$(grep $node /etc/hosts | head -n 1 | awk '{print $3}')
      echo $ip,$node,$short $RSA_PUB
      echo $ip,$node,$short $DSA_PUB

    This example script works for a single interface. You can modify it trivially to allow passwordless connection across multiple interfaces. The format of the known_hosts file is beyond the scope of this article, but it is useful to take advantage of the comma-separated host names for each line.
  7. Allow passwordless root access by linking in the generated keys, as shown below.
    cd /cfmroot/root/.ssh
    ln -s ../../etc/ssh/ssh_host_dsa_key id_dsa
    ln -s ../../etc/ssh/
    ln -s ../../etc/ssh/ssh_host_rsa_key id_rsa
    ln -s ../../etc/ssh/
  8. You might want to ensure this configuration is installed onto each system at installation time, before the operating system is rebooted for the first time. CSM makes no guarantee about the order in which things are run post-installation, so if any post-installation task relies on this configuration being present, it could possibly fail. It might also succeed and give the impression of an inconsistent failure. For example, you might have a GPFS post-installation script and you need to add a node into the GPFS cluster and mount any GPFS file systems. One way to achieve this would be to create a tar archive of all the files created here and unpack them using a CSM post-installation pre-reboot script.

Defining GPFS-related CSM groups

For this example, two main CSM groups are defined for use during the GPFS configuration, as shown below.

  • StorageNodes, which includes only those nodes that are attached directly to the SAN, such as, nodegrp -w "Hostname like 'stor%'" StorageNodes.
  • NonStorageNodes, which includes all other nodes that are part of the GPFS cluster, such as nodegrp -w "Hostname not like 'stor%'" NonStorageNodes.

These groups are used during installation to ensure servers that perform storage node roles receive specific binary files and configuration files, which are detailed below. Note that this section does not cover the detailed process of installation as performed by CSM. See Part 1 and Part 2 of this series for instructions for this process.

To summarize, the installation process goes through the following stages:

  1. PXE boot/DHCP from installation server
  2. NFS installation from installation server
  3. Pre-reboot scripts
  4. Reboot
  5. Post-reboot scripts
  6. CFM file transfer
  7. CSM post installation configuration

The configuration changes in this article occur during the pre-reboot and CFM file transfer stages.

Installing GPFS RPMs

GPFS requires each cluster member to have a base set of GPFS RPMs installed. The level of GPFS used for the example installation was 2.3.0-3. The installation of these RPMs is a two-stage process: installing the 2.3.0-1 base level and then updating to 2.3.0-3. The RPMs used for this installation are:

  • gpfs.base
  • gpfs.gpl
  • gpfs.msg.en_US

Note: Because the example uses GPFS Version 2.3, installation of Reliable Scalable Cluster Technology (RSCT) and creation of a peer domain is not required. Versions of GPFS before Version 2.3 do require those manual steps.

CSM can install the GPFS RPMs in a variety of ways. This article recommends installing the RPMs during the base operating system installation phase. CSM provides an installation and update directory structure to contain customized RPMs, however, this might not work very well for an initial RPM installation followed by an upgrade to the same RPMs as required by GPFS 2.3.0-3.

One alternative method is to write pre-reboot post-installation scripts for CSM to install the RPMs as required. In this case, copy all the GPFS RPMs, including the update RPMs, to a directory under /csminstall/Linux on the management server. The directory CSM usually reserves for script data is /csminstall/csm/scripts/data, which will be mounted on the node during installation, making the needed RPMs available using NFS.

Write the installation script /csminstall/csm/scripts/installprereboot/ to install GPFS. Here is an example installation script:

#! /bin/bash
# Installs GPFS filesets and updates to latest levels
# CSMMOUNTPOINT environment variable is set by CSM
rpm -ivh gpfs.*2.3.0-1*.rpm
rpm -Uvh gpfs.*2.3.0-3*.rpm
echo 'export PATH=$PATH:/usr/lpp/mmfs/bin' > /etc/profile.d/

Once you install GPFS on the storage servers, you might also want to automatically install the FAStT MSJ utility, which can be done in silent (non-interactive) mode. MSJ is used for configuration of the Qlogic adapters, failover, and multipathing, which is described in detail under HBA configuration. The installation is not RPM based, so it is not easily integrated into CSM by default. To accomplish the installation, you can add a script to the end of the GPFS installation to check whether the node is a storage server and install MSJ. To install in silent mode, use the command FAStTMSJ*_install.bin -i silent

Configuring Qlogic failover

The example cluster uses the Qlogic qla2300 driver, version 7.01.01, for the Qlogic QLA 2342 adapters. Each of the nodes in the storage node group has two of these PCI adapters. The qla2300 driver comes standard with the Red Hat Enterprise Linux 3 update 4 distribution. However, you need to make the following changes to suit the purposes of the example cluster:

  • Change the qla2300 driver to perform failover. This enables you to take advantage of more than one path to disk and allow failover to occur if the preferred path fails. This is not set by default.

    Make the first change using a script that is run before reboot during installation by CSM. The script that does this is in the directory /csminstall/csm/scripts/installprereboot/. The script contains the following commands:

    #! /bin/bash
    # Adds lines to /etc/modules.conf to enable failover for the qla2300 drivers
    echo "options qla2300 ql2xfailover=1 ql2xmaxsectors=512 ql2xmaxsgs=128" >>
    echo "Updating initrd with new modules.conf set up"
    mkinitrd -f /boot/initrd-`uname -r`.img `uname -r`
  • Set the preferred path to disk on each host to match those set on each DS4500. Use odd-numbered arrays seen through HBA0, and use even-numbered arrays seen through HBA1.

    The second change needs to be made manually whenever a storage node is reinstalled. The details are covered in the Defining HBA configuration on storage servers section.

Tuning the GPFS network

The example offers several lines to add to the /etc/sysctl.conf file on each node to tune the network for GPFS. This is done using a post-reboot installation script using CSM. The script is in the directory /csminstall/csm/scripts/installpostreboot and contains the following lines:


# Adds lines to /etc/sysctl.conf for GPFS network tuning
echo "# CSM added the next 8 lines to the post installation script for 
          GPFS network tuning" >> $FILE
echo "# increase Linux TCP buffer limits" >> $FILE
echo "net.core.rmem_max = 8388608" >> $FILE
echo "net.core.wmem_max = 8388608" >> $FILE
echo "# increase default and maximum Linux TCP buffer sizes" >> $FILE
echo "net.ipv4.tcp_rmem = 4096 262144 8388608" >> $FILE
echo "net.ipv4.tcp_wmem = 4096 262144 8388608" >> $FILE
echo "# increase max backlog to avoid dropped packets" >> $FILE
echo "net.core.netdev_max_backlog=2500" >> $FILE

# Following lines are not related to GPFS tuning
echo "# Allow Alt-SysRq" >> $FILE
echo "kernel.sysrq = 1" >> $FILE
echo "# Increase ARP cache size" >> $FILE
echo "net.ipv4.neigh.default.gc_thresh1 = 512" >> $FILE
echo "net.ipv4.neigh.default.gc_thresh2 = 2048" >> $FILE
echo "net.ipv4.neigh.default.gc_thresh3 = 4096" >> $FILE
echo "net.ipv4.neigh.default.gc_stale_time = 240" >> $FILE

# Reset the current kernel parameters
sysctl -p /etc/sysctl.conf

Distributing the GPFS portability layer

The GPFS portability layer (PL) is kernel-specific, and it must be created separately for each operating system level within your cluster. The purpose of the PL and the details of creation for the example cluster are described in the Producing and installing the portability layer section. CSM manages the distribution of the PL binaries using the CFM file transfer facility. Copy the PL binaries into the /cfmroot/usr/lpp/mmfs/bin directory on the management servers and name them so that they are only distributed to the nodes with specific kernel versions in the relevant groups. For example:


Note that in a large cluster, in order to reduce load on CFM, it is possible to add these four files into a custom RPM and install with GPFS using the method outlined above for installing the GPFS RPMs.

Automating the addition of new nodes to a GPFS cluster

Simply installing the GPFS RPMS and portability layer is not enough to mount and configure file systems within the GPFS cluster on the newly installed nodes. In a small cluster, this could be managed manually. However, scaling up to larger cluster sizes makes it worth automating this step. This can be done using the CSM monitoring capabilities by monitoring for completed new node installations and kicking off a script to configure and mount GPFS on the new node in the cluster.

Listing 1 shows an example script that can be used to configure GPFS. You might need to modify the script slightly for your configuration. Listing 1 provides the basics. The script takes the name of a node (as passed by CSM monitors), adds this to the GPFS cluster, and attempts to start GPFS on that node with some trivial error handling.

Listing 1. Example script for configuring GPFS

# CSM condition/response script to be used as a response to the InstallComplete
# condition.  This will attempt to add the node to the GPFS cluster, dealing
# with some common failure conditions along the way.  Only trivial attempts are
# made at problem resolution, advanced problems are left for manual
# intervention.

# Note requires the GPFS gpfs-nodes.list file.  This file should contain a list
# of all nodes in the GPFS cluster with client/manager and
# quorum/non-quorum details suitable for passing to the mmcrcluster command.

# Output is sent to /var/log/csm/

# Returned error codes:
# 1 - GPFS is already active
# 2 - unable to read the gpfs-nodes.list file
# 3 - node name not present in the gpfs-nodes.list file
# 4 - node is a quorum node
# 5 - unable to add node to cluster (mmaddnode failed)
# 6 - unable to start GPFS on the node (mmstartup failed)

# set this to the location of your node list file
# set this to the interface GPFS is using for communication

PATH=$PATH:/usr/lpp/mmfs/bin # ensure GPFS binaries are in the PATH
log_file=/var/log/csm/`basename $0`.log # log to /var/log/csm/
touch $log_file

# Get the node short name as set by RSCT condition ENV var ERRM_RSRC_NAME
node=`echo $ERRM_RSRC_NAME | cut -d. -f1`


[ ! -r "$gpfs_node_list" ] && echo " ** error: cannot read GPFS 
node list $gpfs_node_list" && exit 2

echo "--- Starting run of `basename $0` for $node at `date`"

# Is the node a quorum node? If so exit.
quorum_status=`grep $node $gpfs_node_list | cut -d: -f2 | cut -d- -f2`
if [ -n "$quorum_status" ]; then
	if [ "$quorum_status" = "quorum" ]; then
		echo "** error: this is a quorum node, stopping..."
		exit 4
		node_s=`grep $node $gpfs_node_list | cut -d: -f1`
	echo "** error: could not find node $node in GPFS node list $gpfs_node_list"
	exit 3

# Find out if the node is already part of the cluster
if mmlscluster | grep $node &gt;/dev/null; then

	# check the node status
	status=`mmgetstate -w $node | grep $node | awk '{print $3}'`
	if [ "$status" = "active" ]; then
		echo "** error: this node already appears to have GPFS active!"
		exit 1

	# attempt to remove node from cluster
	echo "Node $node is already defined to cluster, removing it"

	# attempt to disable storage interface on node
	if ssh $node $ifdown $gpfs_interface; then
		mmdelnode $node
		ssh $node ifup $gpfs_interface
		echo "** error: could not ssh to $node, or ifdown $gpfs_interface failed"

# try to add node to GPFS cluster
if mmaddnode $node; then
	echo "Successfully added $node to GPFS cluster, starting GPFS on $node"
	if mmstartup -w $node; then
		echo "mmstartup -w $node succeeded"
		echo "** error: cannot start GPFS on $node, please investigate"
		exit 6
	echo "** error: could not add $node to GPFS cluster"
	exit 5

) &gt;&gt;$log_file 2&gt;&1

You can use CSM to automatically run the script shown in Listing 1 when a new node has finished the base operating system installation so that when it boots, the GPFS file systems are automatically mounted. First you need to define the script as a response mechanism in the CSM monitor. For example: mkresponse -n SetupGPFS -s /path/to/script/ SetupGPFS.

You now have a response called SetupGPFS that will run your script. Next you should associate this response to the default CSM condition NodeFullInstallComplete, as follows: startcondresp NodeFullInstallComplete SetupGPFS.

Now CSM will automatically run the script from the management server any time you install a new node. On the CSM management server you should now be able to see the NodeFullInstallComplete condition associated with the SetupGPFS response when you run the lscondresp command. The condition or response should be listed as Active.

Addressing ROM overflow

There is a known issue with the amount of ROM space available on an xSeries 346 that creates PCI allocation errors during boot. Messages indicate that the system ROM space is full, and it has no more room for additional adapters that use ROM space (see Related topics for more details).

This problem affects the storage nodes where, if PXE boot is enabled, there is not sufficient space for the Qlogic PCI adapters to initialize properly. One work around to this is the following:

  1. Disable PXE boot on the Broadcom PCI card used for the GPFS network. Using the downloadable diag facility b57udiag -cmd, choose the device and then disable PXE boot.
  2. Use PXE boot to install the node using CSM, and then disable PXE boot for both onboard adapters using BIOS (hence the order described in the Installing storage nodes in the correct order section.

Another workaround to avoid this issue is to use a RAID 7K card in each xSeries 346. This reduces the amount of ROM the SCSI BIOS uses and allows the Qlogic BIOS to load successfully, even with PXE boot enabled.

Defining HBA configuration on storage servers

The HBAs used on the xSeries 346 storage servers in the example cluster are the IBM DS4000 FC2-133 Host Bus Adapter (HBA) models. These are also known as Qlogic 2342 adapters. The example uses firmware version 1.43 and, as mentioned in the previous section, the v7.01.01-fo qla2300 driver. The -fo on this driver denotes failover, which is not the default option for this driver. This is enabled by changing the settings in the /etc/modules.conf on each storage node. This is set using CSM during install and is described in the Configuring Qlogic failover section.

The next section describes the steps needed to update firmware and settings on the HBAs on each storage server and the manual process required on each reinstall to enable load balancing between the two HBAs.

Downloading HBA firmware

You can download the firmware for the FC2-133 HBAs from the IBM System x support Web site (see Related topics). The firmware can be updated using IBM Management Suite Java or using a bootable diskette and the flasutil program.

Configuring HBA settings

For the example cluster, the following settings were changed from the default on the HBAs. These values are in the README provided with the driver download. You can make this change using the Qlogic BIOS, which you can reach on boot using <ctrl>-q when prompted, or using the MSJ utility. Here are the settings:

  • Host adapter settings
    • Loop reset delay: 8
  • Advanced adapter settings
    • LUNs per target: 0
    • Enable target reset : Yes
    • Port down retry count: 12

Installing IBM Management Suite Java

IBM FAStT Management Suite Java (MSJ) is a Java-based GUI application that manages the HBAs in the storage servers. It can be used for configuration and diagnostics. See Related topics for a link to download the software.

The example setup uses CSM to install MSJ on every storage node as part of the GPFS installation. The binary is part of the tar file containing the GPFS RPMs, which CFS distributes during CSM node installation. A post script uncompresses this tar file, which subsequently runs the installation script contained inside the tar file. The example uses the 32-bit FAStT MSJ in this installation to avoid potential problems installing the 64-bit version. The example script uses the following command to install MSJ: FAStTMSJ*_install.bin -i silent.

This installs both the application and the agent. Note that because this is a 32-bit version of MSJ, and even though the example uses the silent installation, the code looks for and loads 32-bit versions of some libraries. Therefore, use the 32-bit version of XFree86-libs installed, as well as the 64-bit version included with the base 64-bit installation. The 32-bit libraries are contained in the XFree86-libs-4.3.0-78.EL.i386.rpm, which is included in the tar file. The installation of this rpm is handled by the script, which then installs MSJ.

Configuring paths to disk and load balancing

MSJ is required on each storage node to manually configure paths to the arrays on the DS4500s and load balancing between the two HBAs on each computer. If this configuration was not performed, by default the arrays would all be accessed by the first adapter on each computer, HBA0, and consequently the controller A on each DS4500. By spreading the disks between the HBAs, and hence the controllers on the DS4500s, you balance the load and enhance the performance of the back end.

Note that configuration of load balancing is a manual step that must be performed on each storage node each time it is reinstalled. For the example cluster, here are the steps to configure load balancing:

  1. Open a new window on a local computer from the newly installed server with xforwarding set up (ssh <nodename> -X).
  2. In one session, run # qlremote.
  3. In another session, run # /usr/FAStT MSJ & to launch the MSJ GUI.
  4. From the MSJ GUI, highlight one of the adapters under the HBA tab and choose Configure. A window similar to that shown in Figure 1 appears.
    Figure 1. View of MSJ when selecting a DS4500
    View of MSJ when selecting a DS4500
    View of MSJ when selecting a DS4500
  5. To enable load balancing, highlight the storage subsystem represented by right-clicking the node name, and choose the following from the menu: LUNs > Configure LUNs. The LUN configuration window appears. You can automatically configure load balancing by choosing Tools > Load Balance. You' then see a window similar to that shown in Figure 2.
    Figure 2. View of MSJ when configuring failover
    View of MSJ when configuring failover
    View of MSJ when configuring failover
  6. When the logical drives are configured, the LUN configuration window closes, saving the configuration to the host system in the Port Configuration window (which has a default password of config). If the configuration is saved successfully, you see a confirmation. The configuration is saved as a file called /etc/qla2300.conf. New options should have been added to the qla2300 driver line in /etc/modules.conf to indicate that this file exists and should be used.
  7. Switch back to the window where the qlremote process was started and stop it, using <ctrl>-c. This is an important step.
  8. To enable the new configuration, reload the driver module qla2300. This cannot be done if the disk is mounted on the Fibre Channel subsystem attached to an adapter that uses this driver. Configure the host adapter driver to be loaded through an initial RAM disk, which applies the configuration data for redundant disks when loading the adapter module at boot time. Note that whenever the configuration of the logical drives changes, this procedure must be followed to save a valid configuration to the system.

One of the most efficient ways to use MSJ in a setup where more than one storage node needs load balancing configured is to keep MSJ open on one node, run qlremote on each of the other nodes, and then use the one MSJ session to connect to the others in the same half.

Configuring a GPFS cluster

This section covers in detail the steps taken in the creation of a GPFS cluster. It assumes that all nodes have been installed and configured as described earlier in this article, or that the following configuration has been performed manually:

  • GPFS RPMs are installed on each computer.
  • PATH has been changed to include the GPFS binary directory.
  • A storage interface is configured.
  • Root can ssh between nodes without a password.
  • Network tuning settings in sysctl are complete.
  • NSD servers can see a SAN disk.

You can find a detailed description of the GPFS architecture for the example cluster in the "Storage architecture" section in Part 3 of this series.

Read this section of the article in parallel with the GPFS documentation (see Related topics), in particular the following:

  • GPFS V2.3 Administration and Programming Reference, which contains details of many administration tasks and the GPFS commands.
  • GPFS V2.3 Concepts, Planning, and Installation Guide, which details planning considerations for a GPFS cluster and steps to take during installation of a new cluster.
  • GPFS V2.3 Problem Determination Guide, which contains steps to take when troubleshooting, and it contains GPFS error messages.

Producing and installing the portability layer

The GPFS portability layer (PL) is a set of binaries that need to be built locally from source code to match the Linux kernel and configuration on a computer that is to be part of a GPFS cluster. For the example cluster, this was done on one of the storage nodes. The resulting files were copied to each node using CSM and CFM. (See the Distributing the GPFS portability layer section for more details). This is a valid method, because all computers are the same architecture and use the same kernel. The instructions to build the GPFS PL can be found in /usr/lpp/mmfs/src/README. The process for the example cluster is as follows:

  1. Export SHARKCLONEROOT=/usr/lpp/mmfs/src.
  2. Type cd /usr/lpp/mmfs/src/config, cp site.mcr.proto site.mcr.
  3. Edit the new file site.mcr to match the configuration to be used. Leave the following lines uncommented:
    • #define GPFS_LINUX
    • #define GPFS_ARCH_X86_64
    • #define LINUX_KERNEL_VERSION 2042127
    (Note that a # does not indicate a comment.)
  4. Type cd /usr/lpp/mmfs/src.
  5. Create the GPFS PL using make World.
  6. Copy the GPFS PL to the /usr/lpp/mmfs/bin directory using make InstallImages. The GPFS PL consists of the following four files:
    • tracedev
    • mmfslinux
    • lxtrace
    • dumpconv
  7. Copy a set of these files, one for each of the relevant kernels used, into the CSM structure for distribution using CFM.

Creating a GPFS cluster

You create the GPFS cluster for this example using several distinct steps. While all the steps are not necessary, it is a good method to deal with the different types of nodes in the cluster (storage nodes or others).

The first step is to create a cluster containing only the storage nodes and the quorum node: five nodes in total. Use node descriptor files when creating the cluster that contain the short hostnames of the storage interface of all the nodes to be included, followed by the following information:

  • Manager or client: Defines whether the node should form part of the pool from which the configuration and file system managers are picked. The example cluster includes only the storage nodes in this pool.
  • Quorum or nonquorum: Defines whether the node should be counted as a quorum node. The quorum nodes in the example cluster are the storage nodes and the tiebreaker node quor001.

The command to create the cluster is the following:

mmcrcluster -n stor.nodes -C gpfs1 -p stor001_s -s stor002_s -r /usr/bin/ssh -R
  • The -C flag sets the name of the cluster.
  • The -p sets the primary configuration server node.
  • The -s sets the secondary configuration server node.
  • The -r sets the full path for the remote shell program to be used by GPFS.
  • The -R sets the remote file copy program to be used by GPFS.

Here is the stor.nodes node descriptor file used in the example:


Use entries similar to <nodename>_s:client-nonquorum in later stages for all the other nodes to be added to the cluster, such as compute nodes, user nodes, and management nodes.

Enabling unmountOnDiskFail on quorum node

The next step is to enable the unmountOnDiskFail option on the tiebreaker node using mmchconfig unmountOnDiskFail-yes quor001_s. This prevents false disk errors in the SAN configuration from being reported to the file system manager.

Defining network shared disks

The next step is to create the disks used by GPFS using the command mmcrnsd –F disc#.desc. Running this command creates a global name for each disk, which is a necessary step, because disks might have different /dev names on each node in the GPFS cluster. Run this command on all disks to be used for the GPFS file system. At this point, define the primary and secondary NSD servers for each disk; these are used for I/O operations on behalf of the NSD clients, which have no local access to the SAN storage.

The -F flag is used to point to a file containing disk descriptors for disks to be defined as NSDs. For manageability in the example cluster, complete this process once on the LUNs presented by each DS4500 and once on the tiebreaker disk. Each array or LUN on each DS4500 has a descriptor in the files used. Following is an example line from disk1.desc:


Following are the fields in this line, in order:

  • Local disk name on primary NSD server
  • Primary NSD server
  • Secondary NSD server
  • Type of data
  • Failure group
  • Name of resulting NSD

By using the above descriptor files, define the following three failure groups in this configuration:

  • The disks in the first DS4500, that is disk01.
  • The disks in the second DS4500, that is disk02.
  • The tiebreaker disk on the quorum node.

Starting GPFS

The next step is to start GPFS cluster-wide following these steps:

  1. Start GPFS on all of the NSD servers at the same time to prevent NSDs from being marked as down. Use the following command: mmstartup -w stor001_s,stor002_s,stor003_s,stor004_s.
  2. Start GPFS on all other nodes that are not NSD servers (including the tiebreaker node). Use the following command: mmstartup -w quor001_s,mgmt001_s,...
  3. Start GPFS on all compute nodes from the management node. Use the following command: dsh -N ComputeNodes /usr/lpp/mmfs/bin/mmstartup.
  4. Check the status of all nodes by monitoring the /var/adm/log/mmfs.log.latest file on the current file system manager (found using the command mmlsmgr <filesystem>) and the output from the following: mmgetstate -w <nodenames> dsh -N ComputeNodes /usr/lpp/mmfs/bin/mmgetstate.

This method might seem overly cautious, but it has been chosen as a scalable method that will work for a very large cluster. An alternative to the steps above is to use the command mmstartup –a. This works for smaller clusters, but it can take a long time to return for a larger cluster where nodes might be unreachable for various reasons, such as network issues.

Creating GPFS file system

For the example, one large GPFS file system is created using all the NSDs defined to GPFS. Note that the command used takes as an argument the altered disk descriptor files from the mmcrnsd command above. This requires that you concatenate the output from each step in the creation of the NSDs into one file.

The example cluster uses the following settings:

  • All NSDs (set using -F)
  • Mountpoint: /gpfs
  • Automount: yes (set using -A)
  • Blocksize: 256KB (set using -B)
  • Replication: two copies of both data and metadata (set using -m, -M, -r, -R)
  • Estimated number of nodes mounting file system 1200 (set using -n)
  • Quotas enabled (set using -Q)

Here is the complete command:

 mmcrfs /gpfs /dev/gpfs -F disk_all.desc -A yes -B 256K -m 2 -M 2
			                                      -r 2 -R 2 -n
                                                     1200 -Q yes

After creating /gpfs, it is mounted manually for the first time. After this, with automount enabled, it mounts automatically when a node starts GPFS.

Enabling quotas

The -Q flag on the above mmcrfs command enables quotas on the /gpfs file system. Quotas can be defined for individual users or groups of users. A default quota level has also been set that applies to any new user or group. Default quotas are turned on using the command mmdefquotaon. Default quotas are edited using the command mmdefedquota. This command opens an edit window in which you can specify the limits. Following is an example of setting limits for the quota:

gpfs: blocks in use: 0K, limits (soft = 1048576K, hard = 2097152K)
                                        inodes in use: 0, limits (soft = 0, hard = 0)

You can edit specific quotas for a user or group using the command mmedquota –u <username>. A user can display his quota by using the command mmlsquota. The superuser can display the status of the quotas for the file system using the command mmrepquota gpfs.


This cluster is configured so that GPFS starts automatically whenever a server boots by adding an entry in /etc/inittab using the command mmchconfig autoload=yes.

Use GPFS pagepool to cache user data and file system metadata. The pagepool mechanism allows GPFS to implement read, as well as write, requests asynchronously. Increasing the size of pagepool increases the amount of data or metadata that GPFS can cache without requiring synchronous I/O. The default value for pagepool is 64 MB. The maximum GPFS pagepool size is 8 GB. The minimum allowed value is 4 MB. On Linux systems, the maximum pagepool size is half of the physical memory in the computer.

The optimal size of the pagepool depends on the needs of the application and effective caching of its re-accessed data. For systems with applications that access large files, reuse data, benefit from GPFS prefetching of data, or have a random I/O pattern, increasing the value for pagepool might prove beneficial. However, if the value is set too high, GPFS will not start.

For the example cluster, use the value of 512 MB for pagepool for all nodes in the cluster.

Optimizing with network settings

To optimize the performance of the network and, hence, GPFS, enable jumbo frames by setting the MTU size for the adapter for the storage network to 9000. Keep /proc/sys/net/ipv4/tcp_window_scaling enabled, because it is the default setting. The TCP window settings are tuned using CSM scripts at installation time to add the following lines to the /etc/sysctl.conf file on both the NSD servers and NSD clients:

# increase Linux TCP buffer limits
net.core.rmem_max = 8388608
net.core.wmem_max = 8388608
# increase default and maximum Linux TCP buffer sizes
net.ipv4.tcp_rmem = 4096 262144 8388608
net.ipv4.tcp_wmem = 4096 262144 8388608
# increase max backlog to avoid dropped packets

Configuring DS4500 settings

Storage server cache settings can impact GPFS performance if they are not set correctly. The example uses the following settings on the DS4500s, as recommended in the GPFS documentation:

  • Read cache: enabled
  • Read ahead multiplier: 0
  • Write cache: disabled
  • Write cache mirroring: disabled
  • Cache block size: 16K


That is it! You should have successfully installed a large Linux cluster following the example in this series of articles. Apply the principles to your own installation for another successful large Linux cluster installation.

Downloadable resources

Related topics


Sign in or register to add and subscribe to comments.

ArticleTitle=Installing a large Linux cluster, Part 4: Node installation and GPFS cluster configuration