This is the fourth and final part of a series that covers the installation and setup of a large Linux computer cluster. The purpose of the series is to bring together in one place up-to-date information from various sources in the public domain about the process required to create a working Linux cluster from many separate pieces of hardware and software. These articles are not intended to provide the basis for the complete design of a new large Linux cluster; refer to the relevant reference materials and IBM Redbooks® mentioned throughout for general architecture pointers.
This series addresses systems architects and systems engineers to plan and implement a Linux cluster using the IBM eServer™ Cluster 1350 framework (see Resources for more information about the framework). Some parts might also be relevant to cluster administrators for educational purposes and during normal cluster operation. Each part of this article refers to the same example installation.
Part 1 of the series provides detailed instructions for setting up the hardware for the cluster. Part 2 takes you through the next steps after hardware configuration: software installation using the IBM systems management software, Cluster Systems Management (CSM), and node installation.
Part 3 and this article describe the storage back-end of the cluster, covering the storage hardware configuration and the installation and configuration of the IBM shared file system, General Parallel File System (GPFS). Part 3 takes you through the architecture of the storage system, hardware preparation, and details about setting up a Storage Area Network. This, the fourth and final part of the series, provides details about CSM specifics related to the storage backend of the example cluster, notably performing node installation for the storage system, and GPFS cluster configuration.
Detailing node installation specifics
This section details the cluster server management (CSM) specifics related to the storage backend of the example cluster. These include the installation of the GPFS code on each node and the configuration of the Qlogic adapters for the storage nodes. Note that this configuration does not have to be performed using CSM; it can be done manually. The example in this article uses CSM to almost totally automate the installation of a new server, including a storage server.
Reviewing matters of architecture
Before reading this section, you can benefit from reviewing the General cluster architecture section in Part 1 of this series. You can also benefit from reading the section on storage architecture in Part 3. An understanding of the architecture will give you the context you need to make the best use of the information that follows.
Installing storage nodes in the correct order
Installing in the correct order is necessary to work around the ROM overflow section issue described later, because the xSeries™ 346 systems used in this configuration do not have RAID 7K cards. Complete the steps in the following order:
- Run the command
csmsetupks -vxn <nodename>on the management server. - Disconnect the storage server from the SAN to avoid installation of the
operating system on
SAN disks, which are discovered first.
- Run
installnode -vn <nodename>on the management server. - Press F1 from the console when the storage node reboots to enter the BIOS.
- Go into Start Options and change PXEboot from
disabledtoenabled for planar ethernet 1. - Let the node reboot, and installation starts.
- Monitor installation from the terminal server, letting the node boot fully.
- Log into the node and monitor the csm installation
log /var/log/csm/install.log. - Reboot the node when the post reboot tasks have finished.
- Press F1 from the console when the node restarts to enter the BIOS.
- Go into Start Options and change PXEboot to
disabled. - Plug in the SAN cables and let the node boot fully.
- Configure the paths to disk using MSJ as explained under Configuring paths to disks and load balancing.
Providing passwordless root access between nodes
GPFS requires that all nodes in the GPFS cluster have the ability to access each other using the root ID with no password provided. GPFS uses this internode access to allow any node in the GPFS cluster to run relevant commands on other nodes. For the example here, secure shell (ssh) is used to provide access, however you can also use remote shell (rsh). To do this, create a cluster-wide key and associated configuration files, and distribute them with CSM following these steps:
- Create two new directories at
/cfmroot/root/.sshand/cfmroot/etc/ssh. - Create an RSA key pair, public and private keys for authentication, by typing
ssh-keygen -b 1024 -t rsa -f /cfmroot/etc/ssh/ssh_host_rsa_key -N "" -C "RSA_Key"
- Create a DSA key pair, public and private keys for authentication, by typing
ssh-keygen -b 1024 -t dsa -f /cfmroot/etc/ssh/ssh_host_dsa_key -N "" -C "DSA_Key"
- Create an authorization file containing the public keys, as shown below. This
is the file SSH uses to determine whether to prompt for a password.
cat /root/.ssh/id_rsa.pub > /cfmroot/root/.ssh/authorized_keys2 cat /root/.ssh/id_dsa.pub >> /cfmroot/root/.ssh/authorized_keys2 cat /cfmroot/etc/ssh/ssh_host_rsa_key.pub >> /cfmroot/root/.ssh/authorized_keys2 cat /cfmroot/etc/ssh/ssh_host_dsa_key.pub >> /cfmroot/root/.ssh/authorized_keys2
- Stop CSM from maintaining the
known_hostsfile, as shown below. This is a file containing names of hosts. If a host is listed in the file, SSH does not prompt the user for connection confirmation. CSM attempts to maintain this file, but in a fixed cluster environment with passwordless root access, this can be a hindrance.stopcondresp NodeFullInstallComplete SetupSSHAndRunCFM startcondresp NodeFullInstallComplete RunCFMToNode perl -pe 's!(.*update_known_hosts.*)!#$1!' -i /opt/csm/csmbin/RunCFMToNode
- Generate a system-wide
known_hostsfile. This is best done by creating a script, as shown below. Run the script and direct the output to/cfmroot/root/.ssh/known_hosts.#!/bin/bash RSA_PUB=$(cat "/cfmroot/etc/ssh/ssh_host_rsa_key.pub") DSA_PUB=$(cat "/cfmroot/etc/ssh/ssh_host_dsa_key.pub") for node in $(lsnodes); do ip=$(grep $node /etc/hosts | head -n 1 | awk '{print $1}') short=$(grep $node /etc/hosts | head -n 1 | awk '{print $3}') echo $ip,$node,$short $RSA_PUB echo $ip,$node,$short $DSA_PUB done
This example script works for a single interface. You can modify it trivially to allow passwordless connection across multiple interfaces. The format of theknown_hostsfile is beyond the scope of this article, but it is useful to take advantage of the comma-separated host names for each line. - Allow passwordless root access by linking in the generated keys, as shown
below.
cd /cfmroot/root/.ssh ln -s ../../etc/ssh/ssh_host_dsa_key id_dsa ln -s ../../etc/ssh/ssh_host_dsa_key.pub id_dsa.pub ln -s ../../etc/ssh/ssh_host_rsa_key id_rsa ln -s ../../etc/ssh/ssh_host_rsa_key.pub id_rsa.pub
- You might want to ensure this configuration is installed onto each system at installation time, before the operating system is rebooted for the first time. CSM makes no guarantee about the order in which things are run post-installation, so if any post-installation task relies on this configuration being present, it could possibly fail. It might also succeed and give the impression of an inconsistent failure. For example, you might have a GPFS post-installation script and you need to add a node into the GPFS cluster and mount any GPFS file systems. One way to achieve this would be to create a tar archive of all the files created here and unpack them using a CSM post-installation pre-reboot script.
Defining GPFS-related CSM groups
For this example, two main CSM groups are defined for use during the GPFS configuration, as shown below.
StorageNodes, which includes only those nodes that are attached directly to the SAN, such as,nodegrp -w "Hostname like 'stor%'" StorageNodes.NonStorageNodes, which includes all other nodes that are part of the GPFS cluster, such asnodegrp -w "Hostname not like 'stor%'" NonStorageNodes.
These groups are used during installation to ensure servers that perform storage node roles receive specific binary files and configuration files, which are detailed below. Note that this section does not cover the detailed process of installation as performed by CSM. See Part 1 and Part 2 of this series for instructions for this process.
To summarize, the installation process goes through the following stages:
- PXE boot/DHCP from installation server
- NFS installation from installation server
- Pre-reboot scripts
- Reboot
- Post-reboot scripts
- CFM file transfer
- CSM post installation configuration
The configuration changes in this article occur during the pre-reboot and CFM file transfer stages.
GPFS requires each cluster member to have a base set of GPFS RPMs installed. The level of GPFS used for the example installation was 2.3.0-3. The installation of these RPMs is a two-stage process: installing the 2.3.0-1 base level and then updating to 2.3.0-3. The RPMs used for this installation are:
- gpfs.base
- gpfs.docs
- gpfs.gpl
- gpfs.msg.en_US
Note: Because the example uses GPFS Version 2.3, installation of Reliable Scalable Cluster Technology (RSCT) and creation of a peer domain is not required. Versions of GPFS before Version 2.3 do require those manual steps.
CSM can install the GPFS RPMs in a variety of ways. This article recommends installing the RPMs during the base operating system installation phase. CSM provides an installation and update directory structure to contain customized RPMs, however, this might not work very well for an initial RPM installation followed by an upgrade to the same RPMs as required by GPFS 2.3.0-3.
One alternative method is to write pre-reboot post-installation scripts for
CSM to install the RPMs as required. In this case, copy all the GPFS RPMs,
including the update RPMs, to a directory under
/csminstall/Linux on the management server. The
directory CSM usually reserves for script data is
/csminstall/csm/scripts/data, which will be mounted on
the node during installation, making the needed RPMs available using NFS.
Write the installation script
/csminstall/csm/scripts/installprereboot/install-gpfs.sh
to install GPFS. Here is an example installation script:
#! /bin/bash # Installs GPFS filesets and updates to latest levels # CSMMOUNTPOINT environment variable is set by CSM DATA_DIR=$CSMMOUNTPOINT/csm/scripts/data cd $DATA_DIR rpm -ivh gpfs.*2.3.0-1*.rpm rpm -Uvh gpfs.*2.3.0-3*.rpm echo 'export PATH=$PATH:/usr/lpp/mmfs/bin' > /etc/profile.d/gpfs.sh |
Once you install GPFS on the storage servers, you might also want to automatically
install the FAStT MSJ utility, which can be done in silent (non-interactive) mode.
MSJ is used for configuration of the Qlogic adapters, failover, and multipathing,
which is described in detail under HBA configuration. The
installation is not RPM based, so it is not easily integrated into CSM by default.
To accomplish the installation, you can add a script to the end of the GPFS
installation to check whether the node is a storage server and install
MSJ. To install in silent mode, use the command
FAStTMSJ*_install.bin -i silent
The example cluster uses the Qlogic qla2300 driver, version 7.01.01, for the Qlogic QLA 2342 adapters. Each of the nodes in the storage node group has two of these PCI adapters. The qla2300 driver comes standard with the Red Hat Enterprise Linux 3 update 4 distribution. However, you need to make the following changes to suit the purposes of the example cluster:
- Change the qla2300 driver to perform failover. This enables you to take advantage of more
than one path to disk and allow failover to occur if the preferred path
fails. This is not set by default.
Make the first change using a script that is run before reboot during installation by CSM. The script that does this is in the directory
/csminstall/csm/scripts/installprereboot/. The script contains the following commands:#! /bin/bash # Adds lines to /etc/modules.conf to enable failover for the qla2300 drivers echo "options qla2300 ql2xfailover=1 ql2xmaxsectors=512 ql2xmaxsgs=128" >> /etc/modules.conf echo "Updating initrd with new modules.conf set up" mkinitrd -f /boot/initrd-`uname -r`.img `uname -r`
- Set the preferred path to disk on each host to match those set on each DS4500.
Use odd-numbered arrays seen through HBA0, and use even-numbered arrays seen
through HBA1.
The second change needs to be made manually whenever a storage node is reinstalled. The details are covered in the Defining HBA configuration on storage servers section.
The example offers several lines to add to the /etc/sysctl.conf
file on each node to tune the network for GPFS. This is done using a post-reboot
installation script using CSM. The script is in the directory
/csminstall/csm/scripts/installpostreboot and contains the following lines:
FILE=/etc/sysctl.conf
# Adds lines to /etc/sysctl.conf for GPFS network tuning
echo "# CSM added the next 8 lines to the post installation script for
GPFS network tuning" >> $FILE
echo "# increase Linux TCP buffer limits" >> $FILE
echo "net.core.rmem_max = 8388608" >> $FILE
echo "net.core.wmem_max = 8388608" >> $FILE
echo "# increase default and maximum Linux TCP buffer sizes" >> $FILE
echo "net.ipv4.tcp_rmem = 4096 262144 8388608" >> $FILE
echo "net.ipv4.tcp_wmem = 4096 262144 8388608" >> $FILE
echo "# increase max backlog to avoid dropped packets" >> $FILE
echo "net.core.netdev_max_backlog=2500" >> $FILE
# Following lines are not related to GPFS tuning
echo "# Allow Alt-SysRq" >> $FILE
echo "kernel.sysrq = 1" >> $FILE
echo "# Increase ARP cache size" >> $FILE
echo "net.ipv4.neigh.default.gc_thresh1 = 512" >> $FILE
echo "net.ipv4.neigh.default.gc_thresh2 = 2048" >> $FILE
echo "net.ipv4.neigh.default.gc_thresh3 = 4096" >> $FILE
echo "net.ipv4.neigh.default.gc_stale_time = 240" >> $FILE
# Reset the current kernel parameters
sysctl -p /etc/sysctl.conf
|
Distributing the GPFS portability layer
The GPFS portability layer (PL) is kernel-specific, and it must be created
separately for each operating system level within your cluster. The purpose of the
PL and the details of creation for the example cluster are described in the Producing and installing the portability layer section. CSM manages the distribution of
the PL binaries using the CFM file transfer facility. Copy the
PL binaries into the /cfmroot/usr/lpp/mmfs/bin
directory on the management servers and name them so that they are only
distributed to the nodes with specific kernel versions in the relevant groups. For
example:
/cfmroot/usr/lpp/mmfs/bin/dumpconv._<nodegroup> /cfmroot/usr/lpp/mmfs/bin/lxtrace._<nodegroup> /cfmroot/usr/lpp/mmfs/bin/mmfslinux._<nodegroup> /cfmroot/usr/lpp/mmfs/bin/tracedev._<nodegroup> |
Note that in a large cluster, in order to reduce load on CFM, it is possible to add these four files into a custom RPM and install with GPFS using the method outlined above for installing the GPFS RPMs.
Automating the addition of new nodes to a GPFS cluster
Simply installing the GPFS RPMS and portability layer is not enough to mount and configure file systems within the GPFS cluster on the newly installed nodes. In a small cluster, this could be managed manually. However, scaling up to larger cluster sizes makes it worth automating this step. This can be done using the CSM monitoring capabilities by monitoring for completed new node installations and kicking off a script to configure and mount GPFS on the new node in the cluster.
Listing 1 shows an example script that can be used to configure GPFS. You might need to modify the script slightly for your configuration. Listing 1 provides the basics. The script takes the name of a node (as passed by CSM monitors), adds this to the GPFS cluster, and attempts to start GPFS on that node with some trivial error handling.
Listing 1. Example script for configuring GPFS
#!/bin/bash
# CSM condition/response script to be used as a response to the InstallComplete
# condition. This will attempt to add the node to the GPFS cluster, dealing
# with some common failure conditions along the way. Only trivial attempts are
# made at problem resolution, advanced problems are left for manual
# intervention.
# Note requires the GPFS gpfs-nodes.list file. This file should contain a list
# of all nodes in the GPFS cluster with client/manager and
# quorum/non-quorum details suitable for passing to the mmcrcluster command.
# Output is sent to /var/log/csm/
# Returned error codes:
# 1 - GPFS is already active
# 2 - unable to read the gpfs-nodes.list file
# 3 - node name not present in the gpfs-nodes.list file
# 4 - node is a quorum node
# 5 - unable to add node to cluster (mmaddnode failed)
# 6 - unable to start GPFS on the node (mmstartup failed)
# set this to the location of your node list file
gpfs_node_list=/etc/gpfs-nodes.list
# set this to the interface GPFS is using for communication
gpfs_interface=eth1
PATH=$PATH:/usr/lpp/mmfs/bin # ensure GPFS binaries are in the PATH
log_file=/var/log/csm/`basename $0`.log # log to /var/log/csm/
touch $log_file
# Get the node short name as set by RSCT condition ENV var ERRM_RSRC_NAME
node=`echo $ERRM_RSRC_NAME | cut -d. -f1`
(
[ ! -r "$gpfs_node_list" ] && echo " ** error: cannot read GPFS
node list $gpfs_node_list" && exit 2
echo
echo "--- Starting run of `basename $0` for $node at `date`"
# Is the node a quorum node? If so exit.
quorum_status=`grep $node $gpfs_node_list | cut -d: -f2 | cut -d- -f2`
if [ -n "$quorum_status" ]; then
if [ "$quorum_status" = "quorum" ]; then
echo "** error: this is a quorum node, stopping..."
exit 4
else
node_s=`grep $node $gpfs_node_list | cut -d: -f1`
fi
else
echo "** error: could not find node $node in GPFS node list $gpfs_node_list"
exit 3
fi
# Find out if the node is already part of the cluster
if mmlscluster | grep $node >/dev/null; then
# check the node status
status=`mmgetstate -w $node | grep $node | awk '{print $3}'`
if [ "$status" = "active" ]; then
echo "** error: this node already appears to have GPFS active!"
exit 1
fi
# attempt to remove node from cluster
echo "Node $node is already defined to cluster, removing it"
# attempt to disable storage interface on node
if ssh $node $ifdown $gpfs_interface; then
mmdelnode $node
ssh $node ifup $gpfs_interface
else
echo "** error: could not ssh to $node, or ifdown $gpfs_interface failed"
fi
fi
# try to add node to GPFS cluster
if mmaddnode $node; then
echo "Successfully added $node to GPFS cluster, starting GPFS on $node"
if mmstartup -w $node; then
echo "mmstartup -w $node succeeded"
else
echo "** error: cannot start GPFS on $node, please investigate"
exit 6
fi
else
echo "** error: could not add $node to GPFS cluster"
exit 5
fi
) >>$log_file 2>&1
|
You can use CSM to automatically run the script shown in Listing 1 when a new
node has finished the base operating system installation so that when it boots, the
GPFS file systems are automatically mounted. First you need to define the script
as a response mechanism in the CSM monitor. For example:
mkresponse -n SetupGPFS -s /path/to/script/SetupGPFS.sh SetupGPFS.
You now have a response called SetupGPFS that will run your script. Next you
should associate this response to the default CSM condition
NodeFullInstallComplete, as follows:
startcondresp NodeFullInstallComplete SetupGPFS.
Now CSM will automatically run the script from the management server any time you
install a new node. On the CSM management server you should now be able to see the
NodeFullInstallComplete condition associated with the
SetupGPFS response when you run the
lscondresp command. The condition or response should be
listed as Active.
There is a known issue with the amount of ROM space available on an xSeries 346 that creates PCI allocation errors during boot. Messages indicate that the system ROM space is full, and it has no more room for additional adapters that use ROM space (see Resources for more details).
This problem affects the storage nodes where, if PXE boot is enabled, there is not sufficient space for the Qlogic PCI adapters to initialize properly. One work around to this is the following:
- Disable PXE boot on the Broadcom PCI card used for the GPFS network. Using the
downloadable diag facility b57udiag -cmd, choose the device and then disable PXE
boot.
- Use PXE boot to install the node using CSM, and then disable PXE boot for both onboard adapters using BIOS (hence the order described in the Installing storage nodes in the correct order section.
Another workaround to avoid this issue is to use a RAID 7K card in each xSeries 346. This reduces the amount of ROM the SCSI BIOS uses and allows the Qlogic BIOS to load successfully, even with PXE boot enabled.
Defining HBA configuration on storage servers
The HBAs used on the xSeries 346 storage servers in the example cluster are the
IBM DS4000 FC2-133 Host Bus Adapter (HBA) models. These are also known as Qlogic
2342 adapters. The example uses firmware version 1.43 and, as mentioned in the
previous section, the v7.01.01-fo qla2300 driver. The
-fo on this driver denotes failover, which is not the
default option for this driver. This is enabled by changing the settings in the
/etc/modules.conf on each storage node. This is set
using
CSM during install and is described in the Configuring Qlogic
failover section.
The next section describes the steps needed to update firmware and settings on the HBAs on each storage server and the manual process required on each reinstall to enable load balancing between the two HBAs.
You can download the firmware for the FC2-133 HBAs from the IBM System x support Web site (see Resources). The firmware can be updated using IBM Management Suite Java or using a bootable diskette and the flasutil program.
For the example cluster, the following settings were changed from the default on
the HBAs. These values are in the README provided with the driver download.
You can make this change using the Qlogic BIOS, which you can reach on boot using
<ctrl>-q when prompted, or using the MSJ utility. Here are the settings:
- Host adapter settings
- Loop reset delay: 8
- Advanced adapter settings
- LUNs per target: 0
- Enable target reset : Yes
- Port down retry count: 12
Installing IBM Management Suite Java
IBM FAStT Management Suite Java (MSJ) is a Java-based GUI application that manages the HBAs in the storage servers. It can be used for configuration and diagnostics. See Resources for a link to download the software.
The example setup uses CSM to install MSJ on every storage node as part of
the GPFS installation. The binary is part of the tar file containing the GPFS
RPMs, which CFS distributes during CSM node installation. A post script
uncompresses this tar file, which subsequently runs the installation
script contained inside the tar file. The example uses the 32-bit FAStT MSJ in this
installation to avoid potential problems installing the 64-bit version. The
example script uses the following command to install MSJ:
FAStTMSJ*_install.bin -i silent.
This installs both the application and the agent. Note that because this is a 32-bit version of MSJ, and even though the example uses the silent installation, the code looks for and loads 32-bit versions of some libraries. Therefore, use the 32-bit version of XFree86-libs installed, as well as the 64-bit version included with the base 64-bit installation. The 32-bit libraries are contained in the XFree86-libs-4.3.0-78.EL.i386.rpm, which is included in the tar file. The installation of this rpm is handled by the install.sh script, which then installs MSJ.
Configuring paths to disk and load balancing
MSJ is required on each storage node to manually configure paths to the arrays on the DS4500s and load balancing between the two HBAs on each computer. If this configuration was not performed, by default the arrays would all be accessed by the first adapter on each computer, HBA0, and consequently the controller A on each DS4500. By spreading the disks between the HBAs, and hence the controllers on the DS4500s, you balance the load and enhance the performance of the back end.
Note that configuration of load balancing is a manual step that must be performed on each storage node each time it is reinstalled. For the example cluster, here are the steps to configure load balancing:
- Open a new window on a local computer from the newly installed server with
xforwarding set up (
ssh <nodename> -X). - In one session, run
# qlremote. - In another session, run
# /usr/FAStT MSJ &to launch the MSJ GUI. - From the MSJ GUI, highlight one of the adapters under the HBA tab and choose
Configure. A window similar to that shown in Figure 1 appears.
Figure 1. View of MSJ when selecting a DS4500
- To enable load balancing, highlight the storage subsystem represented by
right-clicking the node name, and choose the following from the menu: LUNs
> Configure LUNs. The LUN configuration window appears. You can
automatically configure load balancing by choosing Tools > Load
Balance. You' then see a window similar to that shown in Figure 2.
Figure 2. View of MSJ when configuring failover
- When the logical drives are configured, the LUN configuration window closes,
saving the configuration to the host system in the Port Configuration window
(which has a default password of
config). If the configuration is saved successfully, you see a confirmation. The configuration is saved as a file called/etc/qla2300.conf. New options should have been added to the qla2300 driver line in/etc/modules.confto indicate that this file exists and should be used. - Switch back to the window where the qlremote process was started and stop it, using
<ctrl>-c. This is an important step. - To enable the new configuration, reload the driver module qla2300. This cannot be done if the disk is mounted on the Fibre Channel subsystem attached to an adapter that uses this driver. Configure the host adapter driver to be loaded through an initial RAM disk, which applies the configuration data for redundant disks when loading the adapter module at boot time. Note that whenever the configuration of the logical drives changes, this procedure must be followed to save a valid configuration to the system.
One of the most efficient ways to use MSJ in a setup where more than one storage node needs load balancing configured is to keep MSJ open on one node, run qlremote on each of the other nodes, and then use the one MSJ session to connect to the others in the same half.
This section covers in detail the steps taken in the creation of a GPFS cluster. It assumes that all nodes have been installed and configured as described earlier in this article, or that the following configuration has been performed manually:
- GPFS RPMs are installed on each computer.
- PATH has been changed to include the GPFS binary directory.
- A storage interface is configured.
- Root can ssh between nodes without a password.
- Network tuning settings in sysctl are complete.
- NSD servers can see a SAN disk.
You can find a detailed description of the GPFS architecture for the example cluster in the "Storage architecture" section in Part 3 of this series.
Read this section of the article in parallel with the GPFS documentation (see Resources), in particular the following:
- GPFS V2.3 Administration and Programming Reference, which contains
details of many administration tasks and the GPFS commands.
- GPFS V2.3 Concepts, Planning, and Installation Guide, which details
planning considerations for a GPFS cluster and steps to take during installation
of a new cluster.
- GPFS V2.3 Problem Determination Guide, which contains steps to take when troubleshooting, and it contains GPFS error messages.
Producing and installing the portability layer
The GPFS portability layer (PL) is a set of binaries that need to be built
locally from source code to match the Linux kernel and configuration on a computer
that is to be part of a GPFS cluster. For the example cluster, this was done on one of the
storage nodes. The resulting files were copied to each node using CSM and CFM. (See the
Distributing the GPFS portability layer section for more details). This
is a valid method, because all computers are the same architecture and use the same
kernel. The instructions to build the GPFS PL can be found in
/usr/lpp/mmfs/src/README. The process for
the example cluster is as follows:
- Export
SHARKCLONEROOT=/usr/lpp/mmfs/src. - Type
cd /usr/lpp/mmfs/src/config, cp site.mcr.proto site.mcr. - Edit the new file site.mcr to match the configuration to be used. Leave the
following lines uncommented:
#define GPFS_LINUX#define GPFS_ARCH_X86_64LINUX_DISTRIBUTION = REDHAT_AS_LINUX#define LINUX_DISTRIBUTION_LEVEL 34#define LINUX_KERNEL_VERSION 2042127
a #does not indicate a comment.) - Type
cd /usr/lpp/mmfs/src. - Create the GPFS PL using
make World. - Copy the GPFS PL to the
/usr/lpp/mmfs/bindirectory usingmake InstallImages. The GPFS PL consists of the following four files:tracedevmmfslinuxlxtracedumpconv
- Copy a set of these files, one for each of the relevant kernels used, into the CSM structure for distribution using CFM.
You create the GPFS cluster for this example using several distinct steps. While all the steps are not necessary, it is a good method to deal with the different types of nodes in the cluster (storage nodes or others).
The first step is to create a cluster containing only the storage nodes and the quorum node: five nodes in total. Use node descriptor files when creating the cluster that contain the short hostnames of the storage interface of all the nodes to be included, followed by the following information:
- Manager or client: Defines whether the node should form part of the pool from
which the configuration and file system managers are picked. The example cluster
includes only the storage nodes in this pool.
- Quorum or nonquorum: Defines whether the node should be counted as a quorum
node. The quorum nodes in the example cluster are the storage nodes and the tiebreaker
node
quor001.
The command to create the cluster is the following:
mmcrcluster -n stor.nodes -C gpfs1 -p stor001_s -s stor002_s -r /usr/bin/ssh -R
/usr/bin/scp
|
- The
-Cflag sets the name of the cluster. - The
-psets the primary configuration server node. - The
-ssets the secondary configuration server node. - The
-rsets the full path for the remote shell program to be used by GPFS. - The
-Rsets the remote file copy program to be used by GPFS.
Here is the stor.nodes node descriptor file used in
the example:
stor001_s:manager-quorum stor002_s:manager-quorum stor003_s:manager-quorum stor004_s:manager-quorum quor001_s:client-quorum |
Use entries similar to
<nodename>_s:client-nonquorum in later
stages for all the other nodes to be added to the cluster, such as compute
nodes, user nodes, and management nodes.
Enabling unmountOnDiskFail on quorum node
The next step is to enable the unmountOnDiskFail option on the tiebreaker
node using mmchconfig unmountOnDiskFail-yes quor001_s.
This prevents false disk errors in the SAN configuration from being reported to
the file system manager.
The next step is to create the disks used by GPFS using the command
mmcrnsd âF disc#.desc. Running this command creates a global
name for each disk, which is a necessary step, because disks might have different /dev names on
each node in the GPFS cluster. Run this command on all disks to be used
for the GPFS file system. At this point, define the primary and secondary NSD
servers for each disk; these are used for I/O operations on behalf of the NSD
clients, which have no local access to the SAN storage.
The -F flag is used to point to a file containing disk
descriptors for disks to be defined as NSDs. For manageability in the example
cluster, complete this process once on the LUNs presented by each DS4500 and
once on the tiebreaker disk. Each array or LUN on each DS4500 has a descriptor in the
files used. Following is an example line from disk1.desc:
sdc:stor001_s:stor002_s:dataAndMetadata:1:disk01_array01S |
Following are the fields in this line, in order:
- Local disk name on primary NSD server
- Primary NSD server
- Secondary NSD server
- Type of data
- Failure group
- Name of resulting NSD
By using the above descriptor files, define the following three failure groups in this configuration:
- The disks in the first DS4500, that is
disk01. - The disks in the second DS4500, that is
disk02. - The tiebreaker disk on the quorum node.
The next step is to start GPFS cluster-wide following these steps:
- Start GPFS
on all of the NSD servers at the same time to prevent NSDs from being marked as
down. Use the following command:
mmstartup -w stor001_s,stor002_s,stor003_s,stor004_s. - Start GPFS on all other nodes that are not NSD servers (including the tiebreaker
node). Use the following command:
mmstartup -w quor001_s,mgmt001_s,... - Start GPFS on all compute nodes from the management node. Use the following
command:
dsh -N ComputeNodes /usr/lpp/mmfs/bin/mmstartup. - Check the status of all nodes by monitoring the
/var/adm/log/mmfs.log.latestfile on the current file system manager (found using the commandmmlsmgr <filesystem>) and the output from the following:mmgetstate -w <nodenames> dsh -N ComputeNodes /usr/lpp/mmfs/bin/mmgetstate.
This method might seem overly cautious, but it has been chosen as a scalable method
that will work for a very large cluster. An alternative to the steps above is to
use the command mmstartup âa. This works for smaller
clusters, but it can take a long time to return for a larger cluster where nodes
might be unreachable for various reasons, such as network issues.
For the example, one large GPFS file system is created using all the NSDs
defined to GPFS. Note that the command used takes as an argument the altered disk
descriptor files from the mmcrnsd command above. This
requires that you concatenate the output from each step in the creation of the
NSDs into one file.
The example cluster uses the following settings:
- All NSDs (set using
-F) - Mountpoint:
/gpfs - Automount: yes (set using
-A) - Blocksize: 256KB (set using
-B) - Replication: two copies of both data and metadata (set using
-m, -M, -r, -R) - Estimated number of nodes mounting file system 1200 (set using
-n) - Quotas enabled (set using
-Q)
Here is the complete command:
mmcrfs /gpfs /dev/gpfs -F disk_all.desc -A yes -B 256K -m 2 -M 2
-r 2 -R 2 -n
1200 -Q yes
|
After creating /gpfs, it is mounted manually for the
first time. After this, with automount enabled, it mounts automatically when a
node starts GPFS.
The -Q flag on the above
mmcrfs command enables quotas on the
/gpfs file system. Quotas can be defined for individual
users or groups of users. A default quota level has also been set that applies to
any new user or group. Default quotas are turned on using the command
mmdefquotaon. Default quotas are edited using the
command mmdefedquota. This command opens an edit window
in which you can specify the limits. Following is an example of setting limits for the
quota:
gpfs: blocks in use: 0K, limits (soft = 1048576K, hard = 2097152K)
inodes in use: 0, limits (soft = 0, hard = 0)
|
You can edit specific quotas for a user or group using the command
mmedquota âu <username>. A user can
display his quota by using the command mmlsquota. The
superuser can display the status of the quotas for the file system using the
command mmrepquota gpfs.
This cluster is configured so that GPFS starts automatically whenever a server
boots by adding an entry in /etc/inittab using the
command mmchconfig autoload=yes.
Use GPFS pagepool to cache user data and file system
metadata. The pagepool mechanism allows GPFS to
implement read, as well as write, requests asynchronously. Increasing the size of
pagepool increases the amount of data or metadata that
GPFS can cache without requiring synchronous I/O. The default value for pagepool
is 64 MB. The maximum GPFS pagepool size is 8 GB. The
minimum allowed value is 4 MB. On Linux systems, the maximum
pagepool size is half of the physical memory in the
computer.
The optimal size of the pagepool depends on the needs
of the application and effective caching of its re-accessed data. For systems
with applications that access large files, reuse data, benefit from GPFS prefetching
of data, or have a random I/O pattern, increasing the value for
pagepool might prove beneficial. However, if the value is
set too high, GPFS will not start.
For the example cluster, use the value of 512 MB for
pagepool for all nodes in the cluster.
Optimizing with network settings
To optimize the performance of the network and, hence, GPFS, enable jumbo frames
by setting the MTU size for the adapter for the storage network to 9000. Keep
/proc/sys/net/ipv4/tcp_window_scaling enabled, because
it is the default setting. The TCP window settings are tuned using CSM scripts
at installation time to add the following lines to the
/etc/sysctl.conf file on both the NSD servers and NSD
clients:
# increase Linux TCP buffer limits
net.core.rmem_max = 8388608
net.core.wmem_max = 8388608
# increase default and maximum Linux TCP buffer sizes
net.ipv4.tcp_rmem = 4096 262144 8388608
net.ipv4.tcp_wmem = 4096 262144 8388608
# increase max backlog to avoid dropped packets
net.core.netdev_max_backlog=2500
|
Storage server cache settings can impact GPFS performance if they are not set correctly. The example uses the following settings on the DS4500s, as recommended in the GPFS documentation:
- Read cache: enabled
- Read ahead multiplier: 0
- Write cache: disabled
- Write cache mirroring: disabled
- Cache block size: 16K
That is it! You should have successfully installed a large Linux cluster following the example in this series of articles. Apply the principles to your own installation for another successful large Linux cluster installation.
Learn
- Explore the first three parts of this series:
- Installing a large Linux cluster, Part 1: Introduction and hardware configuration
- Installing a large Linux cluster, Part 2: Management server configuration and node installation
- Installing a large Linux cluster, Part 3: Storage and shared file systems
- See Retain Tip H183415
at the IBM PC Support Web site for more details on the ROM overflow problem.
- Refer to the IBM
GPFS documentation library.
- See the IBM TotalStorage DS4500 system reference
materials:
- Check out the IBM TotalStorage DS4000 EXP710
fiber channel storage expansion unit reference materials:
- Find the IBM TotalStorage SAN Switch H16 switch
reference materials at:
- Want more? The developerWorks
IBM Systems zone
hosts hundreds of informative articles and introductory, intermediate, and advanced tutorials.
- Stay current with
developerWorks technical events and webcasts.
Get products and technologies
- Get firmware for the FC2-133 HBAs
from the IBM System x support Web site.
- Download IBM FAStT Management
Suite Java (MSJ) from the IBM DS4500 download page.
- Get the latest version of
Storage Manager for your hardware from the
DS4500 download page.
- Build your next
development project with IBM trial software
for download directly from developerWorks.
Discuss
- Participate in the discussion forum.
- Exchange information with other developers on
the IBM Systems forums
and developerWorks blogs.

Graham White is a systems management specialist in the Linux Integration Centre within Emerging Technology Services at the IBM Hursley Park office in the United Kingdom. He is a Red Hat Certified Engineer, and he specializes in a wide range of open-source, open-standard, and IBM technologies Graham's areas of expertise include LAMP, Linux, security, clustering, and all IBM Systems hardware platforms. He received a BSc with honors in Computer Science with Management Science from Exeter University in 2000.

Mandie Quartly is an IT specialist with the IBM UK Global Technology Services team. Mandie performs a cross-brand role, with current experience in both Intel and POWER™ platform implementations as well as AIX and Linux (Red Hat and Suse). She specializes in the IBM product General Parallel File System (GPFS). She received a PhD in astrophysics from the University of Leicester in 2001.





