ESS 5000 Common installation instructions

The following common instructions need to be run for a new installation or an upgrade of an ESS 5000 system.
Note: If you have protocol nodes, add them to the commands provided in these instructions. The default /etc/hosts file has host names prt1 and prt2 for protocol nodes. You might have more than two protocol nodes.
  1. Log in to the EMS node by using the management IP (set up by SSR by using the provided worksheet). The default password is ibmesscluster.
  2. Set up a campus or a public connection (interface enP1p8s0f2). Connect an Ethernet cable to C11-T3 on the EMS node to your lab network. This connection serves as a way to access the GUI or the ESA agent (call home) from outside of the management network. The container creates a bridge to the management network, thus having a campus connection is highly advised.
    Note: It is recommended but not mandatory to set up a campus or public connection. If you do not set up a campus or a public connection, you will temporarily lose your connection when the container bridge is created in a later step.

    This method is for configuring the campus network, not any other network in the EMS node. Do not modify T1, T2, or T4 connections in the system after they are set by SSR, and use the SSR method only to configure T1 and T2 (if changing is mandatory after SSR is finished). That includes renaming the interface, setting IP, or any other interaction with those interfaces.

    You can use the nmtui command to set the IP address of the campus interface. For more information, see Configuring IP networking with nmtui.

  3. Complete the /etc/hosts file on the EMS node. This file must contain the low-speed (management) and high-speed (cluster) IP addresses, FQDNs, and short names. The high-speed names must contain a suffix to the low-speed names (For example, essio1-hs (high-speed name) to essio1 (low-speed name)). This file must also contain the container host name and the IP address.
    127.0.0.1 localhost localhost.localdomain.local localhost4 localhost4.localdomain4
    
    ## Management IPs 192.168.45.0/24
    192.168.45.20 ems1.localdomain.local ems1
    192.168.45.21 essio1.localdomain.local essio1
    192.168.45.22 essio2.localdomain.local essio2
    192.168.45.23 prt1.localdomain.local prt1
    192.168.45.24 prt2.localdomain.local prt2
    
    ## High-speed IPs 10.0.11.0/24
    10.0.11.1 ems1-hs.localdomain.local ems1-hs
    10.0.11.2 essio1-hs.localdomain.local essio1-hs
    10.0.11.3 essio2-hs.localdomain.local essio2-hs
    10.0.11.4 pr1-hs.localdomain.local prt1-hs
    10.0.11.5 pr2-hs.localdomain.local prt2-hs
    
    ## Container info 192.168.45.0/24
    192.168.45.80 cems0.localdomain.local cems0
    
    ## Protocol CES IPs
    10.0.11.100 prt_ces1.localdomain.local prt_ces1
    10.0.11.101 prt_ces1.localdomain.local prt_ces1
    10.0.11.102 prt_ces2.localdomain.local prt_ces2
    10.0.11.103 prt_ces2.localdomain.local prt_ces2
    Note:
    • localdomain.local is just an example and cannot be used for deployment. You must change it to a valid fully qualified domain name (FQDN) during the /etc/hosts setup. The domain must be the same for each network subnet that is defined. Also, ensure that you set the domain on the EMS node (hostnamectl set-hostname NAME).

      NAME must be the FQDN of the management interface (T1) of the EMS node. If you need to set other names for campus, or other interfaces, those names must be the alias but not the main host name as returned by the hostnamectl command.

      Start of changeYou can set up the EMS FQDN manually or wait until prompted when the ESS deployment binary is started. At that time, the scripts confirms the FQDN and provides the user a chance to make changes.End of change

    • If you are planning to set up an ESS 3000 system with the ESS 5000 EMS node, add the ESS 3000 host names to /etc/hosts by using the same structure (low-speed (management) and high-speed (cluster) IP addresses, FQDNs, and short names).
    • Start of changeDo not use any special characters, underscores, or dashes in the host names other than the high speed suffix (example: -hs). Doing this might cause issues with the deployment procedure.End of change
  4. Clean up the old containers and images.
    Note: Typically, this is applicable only for upgrades.
    1. List the containers.
      podman ps -a
    2. Stop and remove the containers.
      podman stop ContainerName
      podman rm ContainerName -f
    3. List the images.
      podman images
    4. Remove the images.
      podman image rm ImageID -f
    5. Start of change[Recommended] Remove container bridges as follows.
      1. List the currently configured bridges.
        nmcli c
      2. Clean up any existing bridges before the new container is set up. The bridge names must be mgmt_bridge and fsp_bridge.
        nmcli c del BridgeName
      End of change
  5. Extract the installation package.
    Note: Ensure that you check the version that is installed from manufacturing (SSR worksheet). If there is a newer version available on Fix Central, replace the existing image in /home/deploy with the new image and then remove the old tgz file before doing this step.
    cd /home/deploy
    
    tar zxvf ess5000_6.0.1.2_1204-02_dme.tgz
    
    ess5000_6.0.1.2_1204-02_dme.sh
    ess5000_6.0.1.2_1204-02_dme.sh.sha256
    
  6. Accept the license and install the accepted image.
    ./ess5000_6.0.1.2_1204-02_dme.sh --text-only --start-container
    Note: Start of change
    • The --install-image flag will be deprecated soon. Stop and remove any existing container.
    • The --text-only flag is used to extract the contents of the tgz file after accepting the license agreement. Immediately afterward, the --start-container flag is used to do the following steps automatically.
      • Run essmkyml that prompts the user to:
        • Confirm EMS FQDN and change it.
        • Provide the container short name.
        • Provide a free IP address on the FSP subnet for the container FSP connection.
    End of change

    Press 1 to accept the license after reading the agreement.

    Example of contents of the extracted installation package:
    ess5000_6.0.1.2_1204-02_dme.dir/
    ess5000_6.0.1.2_1204-02_dme.dir/ess5000_6.0.1.2_1204-02_dme.tar
    ess5000_6.0.1.2_1204-02_dme.dir/ess5000_6.0.1.2_1204-02_dme_binaries.iso
    ess5000_6.0.1.2_1204-02_dme.dir/rhel-8.1-server-ppc64le.iso
    ess5000_6.0.1.2_1204-02_dme.dir/podman_rh8.tgz
    ess5000_6.0.1.2_1204-02_dme.dir/essmgr
    ess5000_6.0.1.2_1204-02_dme.dir/essmgr.yml
    ess5000_6.0.1.2_1204-02_dme.dir/Release_note.ess5000_6.0.1.2_1204-02_dme.txt
    ess5000_6.0.1.2_1204-02_dme.dir/classes/
    ess5000_6.0.1.2_1204-02_dme.dir/classes/essmgr_yml.py
    ess5000_6.0.1.2_1204-02_dme.dir/classes/__init__.py
    ess5000_6.0.1.2_1204-02_dme.dir/essmkyml
    
    For this step, you must provide these inputs:
    • Container name (must be in /etc/hosts or be resolvable by using DNS)
    • Container FSP IP address (must be on the same network block that is set on C11-T2)
    • Start of changeConfirmation of the EMS FQDN (must match what is set for the management IP in /etc/hosts). If this value needs to be changed or set, essmkyml helps with that task.End of change
    • Start of changeEMS host name must be on the management network (also called xCAT). Other networks can be aliases (A) or canonical names (CNAME) on DNS or on the /etc/hosts file.
      Is the current EMS FQDN c145f05zems06.gpfs.net correct (y/n):
      End of change
    • Remember not to add the DNS domain localdomain to the input:
      Please type the desired and resolvable short hostname [ess5k-cems0]: cems0
    • Remember that the IP address must belong to the 10.0.0.x/24 network block (It is assumed that the recommended FSP network was used):
      Please type the FSP IP of the container [10.0.0.5]: 10.0.0.80
    Note: The values in parentheses ([ ]) are just examples or the last entered values.

    If all of the checks pass, the essmgr.yml file is written and you can proceed to bridge creation, if applicable, and running the container.

    Note: The original essmgr.yml file and detailed logs of checks that are performed are stored in the ./logs directory.
    Start of changeAt this point, if all checks are successful, the image is loaded and container is started. Example:
    ESS 5000 CONTAINER root@cems0:/ #
    End of change
  7. Run the config load command.
    essrun -N essio1,essio2,ems1 config load -p ibmesscluster
    Note:
    • Use the low-speed management host names. Specify the root password with -p.
    • The password (-p) is the root password of the node. By default, it is ibmesscluster. Consider changing the root password after deployment is complete.
    • This command attempts to connect to each node's FSP interface through IPMI by using the default password (serial number). If the password has changed, you are prompted to enter the new password.
      To determine the serial number, do the following:
      1. Log in to the node by using the management IP address.
      2. Issue this command: cat /proc/device-tree/system-id

    After this command is run, you can use -G for future essrun steps (For example, -G ess_ppc64le).

Instructions if the latest ESS 5000 package version is the same as the one from manufacturing

If the ESS version on the system is already at the latest version shipped from manufacturing, proceed directly to the network bond creation step. If the version is different, use the instructions in the next section.

  1. Create network bonds.
    essrun -G ess_ppc64le network --suffix=-hs
    essrun -N ems1 network --suffix=-hs
  2. Run the network test.

    This test uses nsdperf to determine if the newly created network bonds are healthy.

    SSH from the container to an I/O node or the EMS node.
    ssh essio1
    ESSENV=TEST essnettest -N essio1,essio2 --suffix=-hs
    This command performs the test (with optional RDMA test after if Infiniband). Ensure that there are no errors in the output indicating dropped packets have exceeded thresholds. When completed, type exit to return back to the container.
  3. Create the cluster.
    essrun -G ess_ppc64le cluster --suffix=-hs
  4. Add the EMS node to the cluster.
    essrun -N essio1 cluster --add-ems ems1 --suffix=-hs
  5. Create the file system.
    essrun -G ess_ppc64le filesystem --suffix=-hs
    Note:
    • By default, this command attempts to use all the available space. If you need to create multiple file systems or a CES shared root file system for protocol nodes, consider using less space. For example:
      essrun -G ess_ppc64le filesystem --suffix=-hs --size 80%
    • Start of changeThis step creates combined metadata + data vdisk sets by using a default RAID code and block size. You can use additional flags to customize or use the mmvdisk command directly for advanced configurations.End of change

Instructions if the latest ESS 5000 package version is not the same as the one from manufacturing

Note: Use this procedure if the ESS version from manufacturing is not the latest.
  1. Update the EMS node.
    Important: [Online update only] Ensure that all ESS 5000 nodes are active by first running this command from one of the cluster nodes: mmgetstate -N ess5k_ppc64le. If any nodes are not active, quit the upgrade procedure and resolve this issue before proceeding with the upgrade.
    essrun -N ems1 update --offline
    
    Please enter 'accept' indicating that you want to update the following list of nodes: ems1
    >>> accept
    Note: If the kernel is changed, you are prompted to leave the container, reboot the EMS node, restart the container, and run this command again.
    For example:
    essrun -N ems1 --offline
    Exit
    systemctl reboot
    Navigate back to ESS 6.0.1.2 extracted directory and run the following commands:
    ./essmgr -r
    essrun -N ems1 --offline
  2. Update the IO nodes.
    essrun -G ess_ppc64le update --offline
  3. Create network bonds.
    essrun -G ess_ppc64le network --suffix=-hs
    essrun -N ems1 network --suffix=-hs
  4. Run the network test.

    This test uses nsdperf to determine if the newly created network bonds are healthy.

    SSH from the container to an I/O node or the EMS node.
    ssh essio1
    ESSENV=TEST essnettest -N essio1,essio2 --suffix=-hs
    This command performs the test with an optional RDMA test afterward if there is Infiniband. Ensure that there are no errors in the output indicating dropped packets have exceeded thresholds. When completed, type exit to return back to the container.
  5. Create the cluster.
    essrun -G ess_ppc64le cluster --suffix=-hs
  6. Add the EMS node to the cluster.
    essrun -N essio1 cluster --add-ems ems1 --suffix=-hs
  7. Create the file system.
    essrun -G ess_ppc64le filesystem --suffix=-hs
    Note:
    • By default, this command attempts to use all the available space. If you need to create multiple file systems or a CES shared root file system for protocol nodes, consider using less space. For example:
      essrun -G ess_ppc64le filesystem --suffix=-hs --size 80%
    • Start of changeThis step creates combined metadata + data vdisk sets by using a default RAID code and block size. You can use additional flags to customize or use the mmvdisk command directly for advanced configurations.End of change

Final setup instructions

  1. From the EMS node (outside of the container), configure and start the performance monitoring collector.
    mmperfmon config generate --collectors ems1-hs
  2. From the EMS node (outside of the container), configure and start the performance monitoring sensors.
    mmchnode --perfmon -N ems1-hs,essio1-hs,essio2-hs
  3. Capacity and fileset quota monitoring is not enabled in the GUI by default. You must correctly update the values and restrict collection to the EMS node only.
    1. To modify the GPFS Disk Capacity collection interval, run the following command.
      mmperfmon config update GPFSDiskCap.restrict=EMSNodeName
                GPFSDiskCap.period=PeriodInSeconds

      The recommended period is 86400 so that the collection is done once per day.

    2. To restrict GPFS Fileset Quota to run on the management server node only, run the following command.
      mmperfmon config update GPFSFilesetQuota.period=600 GPFSFilesetQuota.restrict=EMSNodeName

      Here the EMSNodeName must be the name shown in the mmlscluster output.

      Note: To enable quota, the filesystem quota checking must be enabled. Refer mmchfs -Q and mmcheckquota commands in IBM Spectrum Scale: Command and Programming Reference.
  4. Verify that the values are set correctly in the performance monitoring configuration by running the mmperfmon config show command on the EMS node. Ensure that GPFSDiskCap.period is properly set, and GPFSFilesetQuota and GPFSDiskCap are both restricted to the EMS only.
    Note: If you are moving from manual configuration to auto configuration then all sensors are set to default. Make the necessary changes using the mmperfmon command to customize your environment accordingly. For information on how to configure various sensors using mmperfmon, see Manually installing IBM Spectrum Scale GUI.
  5. Start the performance collector on the EMS node.
    systemctl start pmcollector
  6. Start the GUI.
    systemctl start gpfsgui
    1. Create the GUI admin user.
      /usr/lpp/mmfs/gui/cli/mkuser UserName -g SecurityAdmin
    2. In a web browser, enter the public or campus IP address with https and walk through the System Setup wizard instructions.
  7. Log in to each node and run the following command.
    essinstallcheck -N localhost

    Doing this step verifies that all software and cluster versions are up-to-date.

  8. From the EMS node, outside of the container, run the following final health check commands to verify your system health.
    gnrhealthcheck
    mmhealth node show -a
  9. Start of changeSet the time zone and set up Chrony.

    Before getting started, ensure that Chrony and time zone are set correctly on the EMS and I/O nodes. Refer to How to set up chronyd (time server) to perform these tasks before proceeding.

    End of change
  10. Start of changeSet up call home. For more information, see Drive call home.
    The supported call home configurations are:
    • Software call home
    • Node call home (including for protocol nodes)
    • Drive call home
    End of change
  11. Start of changeRefer to Client node tuning recommendations.End of change