Installing a large Linux cluster, Part 3
Storage and shared file systems
Large Linux cluster storage backend
This content is part # of # in the series: Installing a large Linux cluster, Part 3
This content is part of the series:Installing a large Linux cluster, Part 3
Stay tuned for additional content in this series.
This is the third in a series of articles that cover the installation and setup of a large Linux computer cluster. The purpose of the series is to bring together in one place up-to-date information from various sources in the public domain about the process required to create a working Linux cluster from many separate pieces of hardware and software. These articles are not intended to provide the basis for the complete design of a new large Linux cluster; refer to the relevant reference materials and Redbooks™ mentioned throughout for general architecture pointers.
This series addresses systems architects and systems engineers to plan and implement a Linux cluster using the IBM eServer Cluster 1350 framework (see Related topics for more information about the framework). Some parts might also be relevant to cluster administrators for educational purposes and during normal cluster operation. Each part of this article refers to the same example installation.
Part 1 of the series provides detailed instructions for setting up the hardware for the cluster. Part 2 takes you through the next steps after hardware configuration: software installation using the IBM systems management software, Cluster Systems Management (CSM), and node installation.
This third part is the first of two articles that describe the storage backend of the cluster. Together, these two articles cover the storage hardware configuration and the installation and configuration of the IBM shared file system, General Parallel File System (GPFS). This third part takes you through the architecture of the storage system, hardware preparation, and details about setting up a Storage Area Network. The fourth and final part of the series provides details about CSM specifics related to the storage backend of our example cluster, notably performing node installation for the storage system, and GPFS cluster configuration.
Before continuing, you will benefit from reviewing the General cluster architecture section in Part 1 of this series.
Figure 1 shows an overview of the storage configuration used for the example cluster described in this series. The configuration is explained in more detail throughout this article. This setup is based on GPFS version 2.3. It includes one large GPFS cluster split into two logical halves with a single large file system. The example design provides resilience in case of a disaster where, if one half of the storage backend is lost, the other can continue operation.
Figure 1. Storage architecture overview
Figure 1 shows four storage servers that manage the storage provided by two disk subsystems. In the top right-hand corner, you can see a tie-breaker server. The network connections and fiber channel connections are shown for reference. All are described in further detail in the following sections. The rest of the cluster is shown as a cloud and will not be addressed in this article. For more details about the rest of the cluster, see Part 1 and Part 2 of this series.
The majority of the nodes within this GPFS cluster are running Red Hat Enterprise Linux 3. The example uses a server/client architecture, where a small subset of servers has visibility of the storage using a fiber channel. They act as network shared disk (NSD) servers to the rest of the cluster. This means that most of the members of the GPFS cluster access the storage over IP using the NSD servers. There are four NSD nodes (also known here as storage nodes) in total: two in each logical half of the GPFS cluster. These are grouped into pairs, where each pair manages one of the storage subsystems.
As each half of the cluster contains exactly the same number of nodes, should one
half be lost, quorum becomes an issue. With GPFS, for the file system to remain
available, a quorum of nodes needs to be available. Quorum is defined as
Quorum = ( number of quorum nodes / 2 ) + 1.
In a case such as this configuration, where the cluster is made of two identical halves, the GPFS file system becomes unavailable if either half is lost. To avoid this situation, the system employs a tie-breaker node. This node is physically located away from the main cluster. This means that should either half become unavailable, the other half can continue accessing the GPFS file system. This is also made possible by the use of three failure groups, which are further explained under Data replication. This means two copies of the data are available: one in each half of the cluster.
As illustrated in Figure 1, each node is connected to two networks. The first of these is used for compute traffic and general cluster communication. The second network is dedicated to GPFS and is used for storage access over IP for those nodes that do not have a direct view of the Storage Area Network (SAN) storage system. This second network uses jumbo frames for performance. See the GPFS network tuning section in Part 4 of the series for more details on the storage network.
Storage Area Network storage
The storage backend of this solution is comprised of two disk subsystems that are IBM TotalStorage DS4500 (formerly FAStT 900s) disk systems, each with a number of fully populated EXP710 expansion disk drawers attached. Each DS4500 is configured into RAID 5 4+P arrays plus some hot spare disks.
Each DS4500 is owned by a pair of storage servers. The architecture splits the 4+P arrays between the two servers so that each server is the primary server for the first half of the arrays and the secondary server for the other half of the arrays. This way, should one of the servers fail, the other server can take over as primary for the disks from the failed server.
This example has GPFS replicate the data and metadata on the GPFS file system. The storage is spilt into three failure groups. A failure group is a set of logical disks that share a common point of failure. (As seen from the operating system, here a disk corresponds to one LUN, which is one disk array on the DS4500.) The failure groups in this system are made up of the following:
- One DS4500 system in failure group one
- One DS4500 system in failure group two
- A local disk belonging to the tie-breaker node
When you created the GPFS file system, you should have specified the number of copies of data and metadata as two. So, with the failure groups defined above, each half contains one copy of the file system. The third failure group is required to solve disk quorum issues so that should either half of the storage go offline, disk quorum is satisfied, and the file system remains accessible.
As mentioned, this cluster contains two IBM TotalStorage DS4500 devices, which form the storage backend of the solution. You can find more information about this hardware under Related topics.
IBM couples each DS4500 system with IBM TotalStorage DS4000 EXP710 fiber channel (FC) storage expansion units. Each of these is a 14-bay, 2 GBps rack-mountable FC enclosure. You can find more details about this hardware in the Related topics section.
The following section covers in some detail the configuration of the DS4500 and EXP710 units within the example solution.
Order of powering on and off
Note that you need to power on and off the SAN system in a specific order so that all storage is discovered correctly. Perform powering on in the following order:
- SAN switches (and allow them to fully initialize)
- EXP 710 drawers
- DS4500 (and allow it to fully initialize)
- Storage servers
Power off in the opposite order, as follows:
- Storage servers
- EXP 710
- SAN switches
Figure 2 shows the rear of a DS4500 unit. On the left-hand side are four mini-hub ports for host connectivity. In this article, these are referred to as slots 1 to 4, numbered from left to right, as shown in Figure 1. Slots 1 and 3 correspond to the top controller, which is controller A. Slots 2 and 4 correspond to the bottom controller, which is controller B. On the right-hand side are four mini-hub ports for expansion drawer (EXP710) connectivity.
Figure 2: Rear view of a DS4500
Each DS4500 is cabled into two loops as shown in Figure 3.
Figure 3: Example cabling for a DS4500 and EXP drawers
Set EXP enclosure IDs
Each EXP 710 drawer must have a unique ID. These are set using the panel on the back of each enclosure.
Configure IP addresses for DS4500 controllers
Set the IP address of each controller using the serial port at the back of each enclosure. You could use the application hyperterminal on Windows® or minicom on Linux. The example uses the following settings:
- baud 38400
- bits 8
- parity no
- stop bits 1
- flow xon/xoff
Make the connection by sending a break (Ctrl-Break using hyperterminal), then
hitting the space bar to set the speed. Then, send another break and use the
escape key to enter the shell. The default password is
Use the command
netCfgShow to show the current IP
settings of the controller, and use the command
set the desired IP address, subnet mask, and gateway.
Discover DS4500 from Storage Manager
After this point, the DS4500 is managed using the Storage Manager (SM) software. Use the latest version (9.1 or higher) with new hardware.
You can use Storage Manager to:
- Configure arrays and logical drives
- Assign logical drives to storage partitions
- Replace and rebuild failed disk drives
- Expand the size of arrays
- Convert from one RAID level to another
You can also troubleshoot and perform management tasks, such as checking the status of the TotalStorage subsystem and updating the firmware of RAID controllers. See Related topics for the latest version of Storage Manager for your hardware.
The SM client can be installed on a variety of operating systems. In the example described in the article, the SM client is installed on the management server. Discover the newly configured DS4500 from the SM client using the first button on the left, which has a wand on it. To perform operations on a DS4500 seen through this interface, double-click the computer name to open a new window.
General DS4500 controller configuration steps
First, rename DS4500 by going to Storage Subsystem > Rename…, and enter a new name. Next, check that clocks are synchronized by going to Storage Subsystem > Set Controller Clock. There, check that the clocks are all synchronized. Now, set the system password by going to Storage Subsystem > Change > Password.
Update firmware for DS4500 and EXP 710 drawers
To check system firmware levels from the Storage Manager, go to Advanced > Maintenance > Download > Firmware. The current levels are listed at the top of this window. You can download newer versions onto the computer from here, but be sure to use the correct firmware for the model and to upgrade levels in the order specified in any notes that come with the firmware code. The firmware for the disks and the ESMs can be also checked from the Download menu.
Manual configuration versus scripted configuration
The following sections detail the manual set up of a DS4500. Follow these steps for the initial configuration of one of the DS4500s in this solution, saving the configuration of the first DS4500. This action produces a script that you can then use to reproduce the configuration on the same DS4500 should it be reset or replaced with new hardware.
You can replicate this script and edit it for use on the other DS4500 to allow easy and accurate reproduction of a similar set up. You need to change the fields containing the name for the DS4500, disk locations, array names, and mapping details for hosts (that is, the World Wide Port Numbers [WWPNs]). Note that these scripts leave the Access LUN in the host group definition. This is removed manually on each DS4500.
Create hot spare disks
This example uses a number of disks on each DS4500 to remain as hot spares. These are added by right-clicking the disk to be assigned as a hot spare, choosing the manual option, and entering the password for the DS4500 (set in the General DS4500 controller configuration section).
Create disk arrays
- Right-click an unassigned disk to be added to the array, and choose Create Logical Drive.
- Click Next in the wizard that appears.
- Choose RAID level 5. The original drive is already selected.
- Add the four other drives to the array to make five in total.
- Click OK on the Array Success window to create a logical drive on this array.
- Choose the default option, where the whole of the LUN is used for one logical
drive. The naming convention used for the logical drive name is
<ds4500 name>_array<number>. Under Advanced Parameters, choose Customize Settings.
- In I/O Characteristics type, use the default, which is File System, and choose the preferred slot so that the arrays alternate between A and B. In this example, there are odd numbered arrays on slot A and even numbered arrays on slot B.
- Choose Map Later to return to mapping at a later time.
You see a green cylinder with a clock next to it while you create this array. You can check your progress by right-clicking on the logical drive name and choosing Properties.
Note that the steps beyond this point require that you have configured the SAN switches and installed and run the storage servers with the host bus adapters (HBAs) configured so that WWPNs of the HBAs are seen at the SAN switches and, therefore, by the DS4500. See the SAN infrastructure and the HBA configuration sections in Part 4 of the series for details about these steps.
Storage partitioning and disk mapping
Once LUNs are created, they need to be assigned to hosts. In this example, use storage partitioning. Define storage partitions by creating a logical-drive-to-LUN mapping. This grants a host or host group access to a particular logical drive. Perform these steps in order when defining storage partitioning. You will initially define the topology and then the actual storage partition:
- Define the host group.
- Define the hosts within this group.
- Define the host ports for each host.
- Define the storage partition.
As already described, in this setup there is only one group per DS4500, containing the two storage nodes between which all disks on that DS4500 will be twin tailed. All LUNs are assigned to this group, with the exception of the Access LUN, which must not be assigned to this group. The Access LUN is used for in-band management of the DS4500. However, it is not supported by Linux and must be removed from any node groups created.
Create a new host group by right-clicking the Default Group section and selecting Define New Host Group. Enter the host group name. Create a new host by right-clicking the host group created and selecting Define Host Port. In the pull-down menu, select the WWPN corresponding to the HBA to be added. Note that for the WWPN to appear in this menu, you must have configured and zoned the host correctly in the SAN. Storage Manager will then see the port under Show All Host Port Information. The Linux Host Type has been chosen, and the Host port name should be entered in the final box.
Repeat this step so that each host has both ports defined. Next, create the storage partition by right-clicking the newly created host group and selecting Define Storage Partition. This opens the Storage Partitioning wizard. Click Next to start the wizard. Select the Host Group you just created, and click Next. Choose the LUNs you previously defined to include them here. Note that you must not include the Access LUN here. Click Finish to finalize this selection.
This section explains the steps to set up for the SAN infrastructure in a cluster. The SAN switches used in the example configuration are IBM TotalStorage SAN Switch H16 switches (2005-H16). See Related topics for more details about this hardware.
In this section, this article covers in some detail the steps in configuration of SAN switches, referring specifically to commands and interfaces for H16 switches as examples.
Configure IP addresses and hostnames for H16 SAN switches
To perform the initial configuration of the IP addresses on the H16 SAN switches, connect using the serial cable that comes with the switch (black ends, not null modem) into the port at the back of the computer. Use these connection settings:
- 9600 baud
- 8 data bits
- No parity
- 1 stop bit
- No flow control
Use the default login details: username
password. Change the hostname and IP
address using the command
ipAddrSet. Verify the
settings using the command
Once the IP addresses are configured, you can manage the SAN switches with the Web interface. Connect to a SAN switch using the IP address with a browser with a Java™ plugin. To access the Admin interface, click the Admin button and enter the username and password. At this point, you can enter the new name of the switch into the box indicated and apply the changes.
The domain ID must be unique for every domain in a fabric. In this example, the switches are contained in their own fabric, but the IDs are changed in case of future merges. Note that the switch needs to be disabled before you can change the domain ID.
For future reference, once the network can access the switch, you can change the IP address of the SAN switch using the Admin interface from the Network Config tab. This is an alternative to using a serial connection.
SAN switch zoning
The example cluster uses the following zoning rules:
- HBA0 (Qlogic fiber card in PCI slot 3) on all hosts zoned to see controller A (slot 1 and 3) of the DS4500
- HBA1 (Qlogic fiber card in PCI slot 4) on all hosts zoned to see controller B (slot 2 and 4) of the DS4500
You set the zoning of the SAN switches using the Web interface on each switch as described in the previous section. The zoning page can be reached using the far right button in the group on the bottom lefthand corner of the window. To simplify the management of zoning, assign aliases to each WWPN to identify the device attached to the port.
Here is how to create the aliases and assign them to hosts. First, add an alias by clicking Create and entering the name of the alias. Then, choose a WWPN to assign to this newly created alias. You see three levels of detail at each port, as follows:
- The host WWN
- The WWPN
Add the second level to the alias by choosing the second level and selecting Add member.
Once you create aliases, the next step is to create zones by combining groups of aliases. In this configuration, you have used zones where each HBA on each host sees only one controller on the relevant DS4500. As explained in the previous section, in this example setup, each DS4500 presents its disks to only two hosts. Each host uses a different connection to the controller to spread the load and maximize the throughput. This type of zoning is known as single HBA zoning. All hosts are isolated from each other at the SAN level. This zoning removes unnecessary PLOGI activity from host to host, as well as removing the risk of problems caused by a faulty HBA affecting others. As a result of this, the management of the switch becomes safer, because modifying each individual zone does not affect the other hosts. When you add a new host, create new zones also, instead of adding the host to an existing zone.
The final step is to add the zones defined into a configuration that can be saved and then activated. It is useful to produce a switch report, which you can do by clicking the Admin button and then choosing Switch Report. This report contains, in html format, all the information you need to manually recreate the configuration of the switch.
Saving configuration to another server
Once the SAN switch is configured the configuration can be uploaded to another server using ftp. You can do this again if necessary to automatically reconfigure the switch. Here are the steps to save the configuration file to a server:
- Set up and start ftp on the server to receive the file.
- Log into the SAN switch as admin (default password is
password) using telnet.
- Enter the
- Enter the information required: IP address, account and password, and name and location of the file to be created.
You can make firmware updates using download from an FTP server. Here are the steps to follow:
- Set up the FTP server and uncompress the firmware tar package into the ftp directory.
- Use the
firmwareshowcommand to check the current firmware level.
- Use the
firmwaredownloadcommand to start the download process.
- Enter the information required: IP address, account and password, and
the directory currently holding the firmware followed by
/pub/v4.4.0b/release.plist). Do not be confused at this point that the
release.plistfile does not appear to exist. The switch downloads and installs the software and then reboots.
- Log in as admin and check the status of the update using the command
This is only part of setting up the backend of your example cluster. The next steps involve using CSM to complete setup of the storage backend, which includes performing node installation for the storage system and GPFS cluster configuration. The fourth and final part of this series covers those processes.
- RSS feed for this series: Request notification for the upcoming articles in this series. (Find out more about RSS feeds of developerWorks content.)
- Review the first two parts of this series:
- See the IBM TotalStorage DS4500 system reference materials:
- Check out the IBM TotalStorage DS4000 EXP710 fiber channel storage expansion unit reference materials:
- Get the latest version of Storage Manager for your hardware from the DS4500 download page.
- Find the IBM TotalStorage SAN Switch H16 switch reference materials at:
- Want more? The developerWorks IBM Systems zone hosts hundreds of informative articles and introductory, intermediate, and advanced tutorials.
- Build your next development project with IBM trial software for download directly from developerWorks.