Setting up Red Hat Enterprise Linux

Follow these steps to set up your system with Red Hat Enterprise Linux.

Upgrade to a supported version of Red Hat Enterprise Linux

Ensure that your system is on a supported version of Red Hat Enterprise Linux.
  • For POWER8® and x86 systems, the supported version is 7.7.
  • For POWER9™ systems, the supported version is 7.6. Use the instructions below to upgrade.

Red Hat Enterprise Linux 7.5 is no longer supported. If you have Red Hat Enterprise Linux 7.5 installed, upgrade by following these instructions:

subscription-manager release --unset
yum clean all
yum update -y
reboot

Red Hat Enterprise Linux operating system and repository setup

  1. Enable common, optional, and extra repo channels.
    IBM® POWER8:
    sudo subscription-manager repos --enable=rhel-7-for-power-le-optional-rpms
    sudo subscription-manager repos --enable=rhel-7-for-power-le-extras-rpms
    sudo subscription-manager repos --enable=rhel-7-for-power-le-rpms
    IBM POWER9:
    sudo subscription-manager repos --enable=rhel-7-for-power-9-optional-rpms
    sudo subscription-manager repos --enable=rhel-7-for-power-9-extras-rpms
    sudo subscription-manager repos --enable=rhel-7-for-power-9-rpms
  2. Install packages needed for the installation.
    sudo yum -y install wget nano bzip2
  3. Enable the Fedora Project Extra Packages for Enterprise Linux (EPEL) repository:
    wget https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
    sudo rpm -ihv epel-release-latest-7.noarch.rpm
  4. Load the latest kernel or do a full update:
    • Load the latest kernel:
      sudo yum install kernel-devel
      sudo yum update kernel kernel-devel kernel-tools kernel-tools-libs kernel-bootwrapper
      reboot
    • Do a full update:
      sudo yum install kernel-devel
      sudo yum update
      sudo reboot

System firmware

If you are running on an AC922 system, you need to update the firmware. Ensure that the system firmware is updated to at least the following levels before you install the current NVIDIA GPU driver.

The firmware series and fix levels that are required for AC922 for the current NVIDIA GPU driver are:

  • 8335-GTG: OP910.30 or higher
  • 8335-GTH: OP920.10 or higher

System firmware updates are available at Fix Central. To find your updates in Fix Central, follow these steps:

  1. Enter 8335-GTG or 8335-GTH as the Product Selector.
  2. Select the appropriate firmware series from the drop-down list.
  3. Click Continue to go to the Select fixes page.
  4. Select the appropriate fix level.
  5. Click Continue to go to the Download options page.

IBM POWER9 specific udev rules

Before you install the NVIDIA components, the udev Memory Auto-Onlining Rule must be disabled for the CUDA driver to function properly.
Note: If you upgraded from a previous release, repeat this step with RHEL 7.6.

To disable it, follow these steps:

  1. Copy the /lib/udev/rules.d/40-redhat.rules file to the directory for user overridden rules:
    sudo cp /lib/udev/rules.d/40-redhat.rules /etc/udev/rules.d/
  2. Edit the /etc/udev/rules.d/40-redhat.rules file:
    sudo nano /etc/udev/rules.d/40-redhat.rules
  3. Comment out the entire "Memory hotadd request" section and save the change:
    # Memory hotadd request
    #SUBSYSTEM!="memory", ACTION!="add", GOTO="memory_hotplug_end"
    #PROGRAM="/bin/uname -p", RESULT=="s390*", GOTO="memory_hotplug_end"
    
    #ENV{.state}="online"
    #PROGRAM="/bin/systemd-detect-virt", RESULT=="none", ENV{.state}="online_movable"
    #ATTR{state}=="offline", ATTR{state}="$env{.state}"
    
    #LABEL="memory_hotplug_end"
  4. Optionally, delete the first line of the file, since the file was copied to a directory where it cannot be overwritten:
    # do not edit this file, it will be overwritten on update
  5. Restart the system for the changes to take effect:
    sudo reboot

Remove previously installed CUDA and NVIDIA drivers

The CUDA Toolkit, cuDNN and NCCL are provided as Conda packages and do not require separate installations. The GPU driver must still be installed separately.

Note: If you require the CUDA Toolkit on the host for uses beyond WML CE, consult NVIDIA's CUDA documentation for help upgrading the GPU driver without disturbing your existing Toolkit installation.

Before installing the updated GPU driver, uninstall any previously-installed CUDA and NVIDIA drivers. Follow these steps:

  1. Remove all CUDA Toolkit and GPU driver packages.

    You can display installed CUDA and driver packages by running these commands:

    rpm -qa | egrep 'cuda.*(9-2|10-0)'
    rpm -qa | egrep '(cuda|nvidia).*(396|410)\.'

    Verify the list and remove with yum remove.

  2. Remove any CUDA Toolkit and GPU driver repository packages.

    These should have been included in step 1, but you can confirm with this command:

    rpm -qa | egrep '(cuda|nvidia).*repo'

    Use yum remove to remove any that remain.

  3. Clean the yum repository:
    sudo yum clean all
  4. Remove cuDNN and NCCL:
    sudo rm -rf /usr/local/cuda /usr/local/cuda-9.2 /usr/local/cuda-10.0
  5. Reboot the system to unload the GPU driver:
    sudo shutdown -r now

Install the GPU driver

Many of the deep learning packages require the GPU driver packages to be downloaded from NVIDIA. See the WML CE prerequisites for the required and recommended versions of these components.

Install the GPU driver by following these steps:

  1. Download the NVIDIA GPU driver:
    • Go to NVIDIA Driver Download.
    • Select Product Type: Tesla.
    • Select Product Series: P-Series (for Tesla P100) or V-Series (for Tesla V100).
    • Select Product: Tesla P100 or Tesla V100.
    • Select Operating System: Linux POWER LE RHEL 7 . Click Show all Operating Systems if your version is not available.
    • Select CUDA Toolkit: 10.1.
    • Click SEARCH to go to the download link.
    • Click Download to download the driver.
  2. Install the GPU driver repository and cuda-drivers:
    sudo rpm -ivh nvidia-driver-local-repo-rhel7-418.*.rpm
    sudo yum install nvidia-driver-latest-dkms
  3. Set nvidia-persistenced to start at boot:
    sudo systemctl enable nvidia-persistenced
  4. Reboot the system.

Installing Mellanox drivers

In order to use Infiniband with IBM Distributed Deep Learning and SnapML, install the latest Mellanox Driver from the Mellanox IBM Systems and Storage page.

Installing Perl

In order to use Spectrum MPI with IBM Distributed Deep Learning and SnapML, Perl must be installed on the system. Install Perl using the following command:
sudo yum install perl