Setting up Red Hat Enterprise Linux

Follow these steps to set up your IBM POWER8 or POWER9 system with Red Hat Enterprise Linux

Upgrade to 7.6

Red Hat Enterprise Linux 7.5 is no longer supported. If you have RHEL 7.5 installed, upgrade to 7.6:

subscription-manager release --unset
yum clean all
yum update -y
reboot

Red Hat Enterprise Linux operating system and repository setup

  1. Enable common, optional, and extra repo channels.
    IBM® POWER8:
    sudo subscription-manager repos --enable=rhel-7-for-power-le-optional-rpms
    sudo subscription-manager repos --enable=rhel-7-for-power-le-extras-rpms
    sudo subscription-manager repos --enable=rhel-7-for-power-le-rpms
    IBM POWER9:
    sudo subscription-manager repos --enable=rhel-7-for-power-9-optional-rpms
    sudo subscription-manager repos --enable=rhel-7-for-power-9-extras-rpms
    sudo subscription-manager repos --enable=rhel-7-for-power-9-rpms
  2. Install packages needed for the installation.
    sudo yum -y install wget nano bzip2
  3. Enable Fedora Project EPEL (Extra Packages for Enterprise Linux repo:
    wget https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
    sudo rpm -ihv epel-release-latest-7.noarch.rpm
  4. Load the latest kernel or do a full update:
    • Load the latest kernel:
      sudo yum update kernel kernel-devel kernel-tools kernel-tools-libs kernel-bootwrapper
      reboot
    • Do a full update:
      sudo yum update
      sudo reboot

System firmware

If you are running on an AC922 system, you need to update the firmware. Ensure that the system firmware is updated to at least the following levels before you install the current NVIDIA GPU driver.

The firmware series and fix levels that are required for AC922 for the current NVIDIA GPU driver are:

  • 8335-GTG: OP910.30 or higher
  • 8335-GTH: OP920.10 or higher

System firmware updates are available at Fix Central. To find your updates in Fix Central, follow these steps:

  1. Enter 8335-GTG or 8335-GTH as the Product Selector.
  2. Select the appropriate firmware series from the drop-down list.
  3. Click Continue to go to the Select fixes page.
  4. Select the appropriate fix level.
  5. Click Continue to go to the Download options page.

IBM POWER9™ specific udev rules

Before you install the NVIDIA components, the udev Memory Auto-Onlining Rule must be disabled for the CUDA driver to function properly.
Note: If you upgraded from a previous release, repeat this step with RHEL 7.6.

To disable it, follow these steps:

  1. Copy the /lib/udev/rules.d/40-redhat.rules file to the directory for user overridden rules.
    sudo cp /lib/udev/rules.d/40-redhat.rules /etc/udev/rules.d/
  2. Edit the /etc/udev/rules.d/40-redhat.rules file.
    sudo nano /etc/udev/rules.d/40-redhat.rules
  3. Comment out the entire "Memory hotadd request" section and save the change:
    # Memory hotadd request
    #SUBSYSTEM!="memory", ACTION!="add", GOTO="memory_hotplug_end"
    #PROGRAM="/bin/uname -p", RESULT=="s390*", GOTO="memory_hotplug_end"
    
    #ENV{.state}="online"
    #PROGRAM="/bin/systemd-detect-virt", RESULT=="none", ENV{.state}="online_movable"
    #ATTR{state}=="offline", ATTR{state}="$env{.state}"
    
    #LABEL="memory_hotplug_end"
  4. Optionally, delete the first line of the file, since the file was copied to a directory where it cannot be overwritten.
    # do not edit this file, it will be overwritten on update
  5. Restart the system for the changes to take effect.
    sudo reboot

Remove previously installed CUDA and NVIDIA drivers

The CUDA Toolkit, cuDNN and NCCL are provided as Conda packages and no longer require separate installations. The GPU driver must still be installed separately.

Note: If you require the CUDA Toolkit on the host for uses beyond PowerAI, consult NVIDIA's CUDA documentation for help upgrading the GPU driver without disturbing your existing Toolkit installation.

Before installing the updated GPU driver, uninstall any previously-installed CUDA and NVIDIA drivers. Follow these steps:

  1. Remove all CUDA Toolkit and GPU driver packages.

    You can display installed CUDA and driver packages by running these commands:

    rpm -qa | egrep 'cuda.*(9-2|10-0)'
    rpm -qa | egrep '(cuda|nvidia).*(396|410)\.'

    Verify the list and remove with yum remove.

  2. Remove any CUDA Toolkit and GPU driver repository packages.

    These should have been included in step 1, but you can confirm with this command:

    rpm -qa | egrep '(cuda|nvidia).*repo'

    Use yum remove to remove any that remain.

  3. Clean the yum repository:
    sudo yum clean all
  4. Remove cuDNN and NCCL:
    sudo rm -rf /usr/local/cuda /usr/local/cuda-9.2 /usr/local/cuda-10.0
  5. Reboot the system to unload the GPU driver
    sudo shutdown -r now

Install the GPU driver

The Deep Learning packages require the GPU driver packages to be downloaded from NVIDIA. See the PowerAI prerequisites for the required and recommended versions of these components.

Install the GPU driver by following these steps:

  1. Download the NVIDIA GPU driver:
    • Go to NVIDIA Driver Download.
    • Select Product Type: Tesla
    • Select Product Series: P-Series (for Tesla P100) or V-Series (for Tesla V100).
    • Select Product: Tesla P100 or Tesla V100
    • Select Operating System: Linux POWER LE RHEL 7 . Click Show all Operating Systems if your version is not available.
    • Select CUDA Toolkit: 10.1
    • Click SEARCH to go to the download link.
    • Click Download to download the driver.
  2. Install the GPU driver repository and cuda-drivers.
    sudo rpm -ivh nvidia*driver-local-repo-rhel7-418.*.rpm
    sudo yum install cuda-drivers
  3. Set nvidia-persistenced to start at boot
    sudo systemctl enable nvidia-persistenced
  4. Reboot the system

Installing Mellanox drivers

In order to use Infiniband with IBM Distributed Deep Learning and SnapML, install the latest Mellanox Driver from http://www.mellanox.com/page/firmware_table_IBM_SystemP.