PowerAI system setup

Find information to set up your operating system, repository, and NVIDIA components.

For AC922 systems that use the latest NVIDIA GPU driver, the GPU driver requires other updates that must be installed in a specific order:
  1. Latest Linux kernel for RHEL 7.5 ALT

    You can also run PowerAI in a container on a bare metal system that is running Ubuntu 18.04.

  2. Recent AC922 system firmware:
    • 8335-GTG: OP910.24
    • 8335-GTH: OP920.02
  3. NVIDIA GPU driver 410.72 or higher

Operating system

The Deep Learning packages require specific operating systems:

Table 1. Supported configurations
Host OS Container OS
Red Hat Enterprise Linux 7.5 Ubuntu 18.04
Ubuntu 18.04 Ubuntu 18.04
Red Hat Enterprise Linux 7.5 none (Bare metal)

For more information about installing operating systems on IBM Power Systems servers, see Quick start guides for Linux on IBM® Power System servers.

Red Hat Enterprise Linux operating system and repository setup

  1. Enable common, optional, and extra repo channels.

    IBM POWER8:

    sudo subscription-manager repos --enable=rhel-7-for-power-le-optional-rpms
    sudo subscription-manager repos --enable=rhel-7-for-power-le-extras-rpms
    sudo subscription-manager repos --enable=rhel-7-for-power-le-rpms

    IBM POWER9:

    sudo subscription-manager repos --enable=rhel-7-for-power-9-optional-rpms
    sudo subscription-manager repos --enable=rhel-7-for-power-9-extras-rpms
    sudo subscription-manager repos --enable=rhel-7-for-power-9-rpms
  2. Install packages needed for the installation.
    sudo yum -y install wget nano bzip2
  3. Enable Fedora Project EPEL (Extra Packages for Enterprise Linux) repo:
    wget https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
    sudo rpm -ihv epel-release-latest-7.noarch.rpm
  4. Load the latest kernel or do a full update:
    • Load the latest kernel:
      sudo yum update kernel kernel-tools kernel-tools-libs kernel-bootwrapper
      reboot
    • Do a full update:
      sudo yum update
      sudo reboot
      Important: RHEL 7.6 was released at the end of October, but is not yet supported by PowerAI. Running just yum update might upgrade a 7.5 system to 7.6. In order to avoid this, customers with a standard RHEL subscription might use:
      sudo subscription-manager release --set=7.5
      Customers should consult Red Hat if they’re unsure how to avoid unintended upgrade.

Ubuntu operating system and repository setup

  1. Install packages needed for the installation
    sudo apt-get install -y wget nano apt-transport-https ca-certificates curl software-properties-common
  2. Load the latest kernel
    sudo apt-get install linux-headers-$(uname -r)
    sudo reboot
  3. Or do a full update
    sudo apt-get update
    sudo apt-get dist-upgrade
    sudo reboot 

System firmware

If you are running on an AC922 system, you need to update the firmware. Ensure that the system firmware is updated to at least the following levels before you install the current NVIDIA GPU driver.

The firmware series and fix levels that are required for AC922 for the current NVIDIA GPU driver are:

  • 8335-GTG: OP910.24 or higher
  • 8335-GTH: OP920.02 or higher

System firmware updates are available at Fix Central. To find your updates in Fix Central, follow these steps:

  1. Enter 8335-GTG or 8335-GTH as the Product Selector.
  2. Select the appropriate firmware series from the drop-down list.
  3. Click Continue to go to the Select fixes page.
  4. Select the appropriate fix level.
  5. Click Continue to go to the Download options page.

IBM POWER9 specific udev rules (Red Hat only)

Before you install the NVIDIA components, the udev Memory Auto-Onlining Rule must be disabled for the CUDA driver to function properly. To disable it, follow these steps:

  1. Copy the /lib/udev/rules.d/40-redhat.rules file to the directory for user overridden rules.
    sudo cp /lib/udev/rules.d/40-redhat.rules /etc/udev/rules.d/
  2. Edit the /etc/udev/rules.d/40-redhat.rules file.
    sudo nano /etc/udev/rules.d/40-redhat.rules
  3. Comment out the following line and save the change:
    SUBSYSTEM=="memory", ACTION=="add", PROGRAM="/bin/uname -p", 
    RESULT!="s390*", ATTR{state}=="offline", ATTR{state}="online"
  4. Optionally, delete the first line of the file, since the file was copied to a directory where it cannot be overwritten.
    # do not edit this file, it will be overwritten on update
  5. Restart the system for the changes to take effect.
    sudo reboot

Install the kernel development packages

Install the kernel development packages for the currently running kernel by running the following command:

  • On Red Hat:
    sudo yum install kernel-devel-$(uname -r) kernel-headers-$(uname -r)
  • On Ubuntu:
    sudo apt-get install linux-headers-$(uname -r)

Remove previously installed CUDA and NVIDIA drivers (Red Hat only)

Before installing CUDA 10, uninstall any previous installations of CUDA and NVIDIA drivers. Follow these steps:

  1. Run the following command:
    sudo yum remove libglvnd*
    Note that removing libglvnd* also uninstalls the Nvidia drivers.
  2. Verify that the drivers were uninstalled:
    sudo yum list installed | grep cuda
    If a previous local repo is found, uninstall it:
    sudo rpm -e cuda-repo-rhel7-9-2-local-9.2.148-1.ppc64le
  3. Finally run
    sudo yum clean all
    If you get a message to remove the yum cache, run
    sudo rm -rf /var/cache/yum

CUDA, GPU driver, cuDNN, and NCCL (Red Hat only)

The Deep Learning packages require CUDA, cuDNN, and GPU driver packages from NVIDIA. See the PowerAI prerequisites for the required and recommended versions of these components.

Install the components by following these steps:

  1. Download NVIDIA CUDA 10
    • Select Operating System: Linux.
    • Select Architecture: ppc64le.
    • Select Distribution: RHEL.
    • Select Version: 7.
    • Select Installer Type: rpm (network).
    • Follow the Linux on POWER installation instructions in the CUDA Quick Start Guide, including the steps that describe how to set up the CUDA development environment by updating PATH and LD_LIBRARY_PATH.
  2. Download NVIDIA driver 410
    • Select Product Type: Tesla
    • Select Product Series: P-Series
    • Select Product: Tesla P100
    • Select Operating System: Linux POWER LE RHEL 7
    • Select CUDA Toolkit: 10.0
    • Click Search to go do the download link.
    Note: See Table 1 for supported and recommended drivers.
  3. Install CUDA and the GPU driver.
    Note: For AC922 systems, OS and system firmware updates are required before you install the latest GPU driver.
    At a high level, the installation process is:
    • Install the CUDA Base repository rpm
    • Install the GPU driver repository rpm
    • Run sudo yum install cuda to install CUDA and the GPU driver
    • Restart to activate the driver
    For more information, see the Linux POWER® installation instructions in the CUDA Quick Start Guide. It includes steps for setting up the CUDA development environment by updating PATH and LD_LIBRARY_PATH.
  4. Download NVIDIA cuDNN v7.3.1 for CUDA 10.0 (Registration in NVIDIA’s Accelerated Computing Developer Program is required).
    • cuDNN v7.3.1 Library for Linux (Power8/Power9)
  5. Download NVIDIA NCCL v2.3.5 for CUDA 10.0 (Registration in NVIDIA’s Accelerated Computing Developer Program is required).
    • NCCL 2.3.5 O/S agnostic and CUDA 10.0 and IBM Power
  6. Install the cuDNN v7.3.1 and NCCL v2.3.5 packages. Refresh shared library cache.
    sudo tar -C /usr/local --no-same-owner -xzvf cudnn-10.0-linux-ppc64le-v7.3.1.20.tgz
    sudo tar -C /usr/local/cuda/targets/ppc64le-linux/ --no-same-owner --strip-components=1 -xvf nccl_2.3.5-5+cuda10.0_ppc64le.txz
    sudo ldconfig

NVIDIA Persistence Daemon (Red Hat only)

The NVIDIA Persistence Daemon may be automatically started for POWER9 installations. Check that it is running with the following command:

systemctl status nvidia-persistenced

If it is not active, run the following command:

sudo systemctl enable nvidia-persistenced

GPU driver, docker, nvidia-docker2 (Ubuntu only)

To run PowerAI within docker containers, only the GPU driver needs to be installed on the host.

  1. Download NVIDIA driver 410.72 from http://www.nvidia.com/Download/index.aspx.
    • Select Product Type: Tesla
    • Select Product Series: P-Series
    • Select Product: Tesla P100
    • Select Operating System: Linux POWER LE Ubuntu 18.04 (If Linux POWER LE Ubuntu 18.04 is not available, click Show all Operating Systems)
    • Select CUDA Toolkit: 10.0
    • Click Search to go do the download link
  2. Install the GPU driver repository deb package and cuda-drivers.
    sudo dpkg -i nvidia-driver-local-repo-ubuntu1804-410.72_1.0-1_ppc64el.deb
    sudo apt-get update
    sudo apt-get install cuda-drivers
    
  3. Edit the nvidia-persistenced file.
    sudo systemctl edit --full nvidia-persistenced
    

    Replace the contents with the following lines:

    [Unit]
    Description=NVIDIA Persistence Daemon
    Wants=syslog.target
    
    [Service]
    Type=forking
    PIDFile=/var/run/nvidia-persistenced/nvidia-persistenced.pid
    Restart=always
    ExecStart=/usr/bin/nvidia-persistenced --verbose
    ExecStopPost=/bin/rm -rf /var/run/nvidia-persistenced
    TimeoutSec=300
    
    [Install]
    WantedBy=multi-user.target
    
  4. Set nvidia-persistenced to start at boot
    sudo systemctl enable nvidia-persistenced
    
  5. Restart your system.
  6. Install docker.For Ubuntu platforms, a Docker runtime must be installed. If there is no Docker runtime installed yet, install Docker-CE on Ubuntu.
    curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
    sudo add-apt-repository "deb [arch=ppc64el] https://download.docker.com/linux/ubuntu bionic stable"
    sudo apt-get update
    sudo apt-get install docker-ce
    
  7. Install nvidia-docker 2.
    curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
    distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
    curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
      sudo tee /etc/apt/sources.list.d/nvidia-docker.list
    sudo apt-get update
    sudo apt-get install nvidia-docker2
    sudo pkill -SIGHUP dockerd
    
  8. Verify the setup.
    nvidia-docker run --rm nvidia/cuda nvidia-smi

Anaconda

A number of the Deep Learning frameworks require Anaconda. Anaconda is a platform-agnostic data science distribution with a collection of 1,000+ open source packages with free community support.

Use Anaconda2 with Python 2 to run the Python 2 versions of the Deep Learning frameworks. Anaconda3 with Python 3 is required to run the Python 3 versions of the Deep Learning frameworks.

  1. Download Anaconda:
    wget https://repo.continuum.io/archive/Anaconda2-5.2.0-Linux-ppc64le.sh
  2. Install Anaconda
    bash Anaconda2-5.2.0-Linux-ppc64le.sh
    source ~/.bashrc
    1. Accept the license agreement
    2. Specify an installation location (default is $HOME/anaconda2)
    3. Set the PATH environment variable. For systems that have a single Anaconda instance, such as PowerAI Enterprise, multiple users are
      • For setups that have a single Anaconda instance for multiple users, such as PowerAI Enterprise, reply no to update the .bashrc file or .bash_profile. After the installation is complete, export the path with this command:
        export PATH=/opt/anaconda2/bin:$PATH
      • For other PowerAI users, reply yes to allow the installer to update the .bashrcfile or .bash_profile. In this case, if multiple users are using the same system, each user should install Anaconda individually.
Note: Anaconda 5.3 is also supported, but contains Python 3.7, which is not supported. To use Anaconda 5.3, uninstall Python 3.7 and install 3.6.