Prerequisites for installing PowerAI Vision

Before you can install either PowerAI Vision stand-alone or PowerAI Vision with IBM Cloud Private, you must configure Red Hat Enterprise Linux (RHEL), enable the Fedora Extra Packages for Enterprise Linux (EPEL) repository, and install NVIDIA CUDA drivers.

Note: Neither IBM® PowerAI nor Watson Machine Learning Accelerator (WML Accelerator) are required for running PowerAI Vision.

Red Hat Enterprise Linux operating system and repository setup

  1. Enable common, optional, and extra repo channels.
    IBM POWER8:
    sudo subscription-manager repos --enable=rhel-7-for-power-le-optional-rpms
    sudo subscription-manager repos --enable=rhel-7-for-power-le-extras-rpms
    sudo subscription-manager repos --enable=rhel-7-for-power-le-rpms
    IBM POWER9:
    sudo subscription-manager repos --enable=rhel-7-for-power-9-optional-rpms
    sudo subscription-manager repos --enable=rhel-7-for-power-9-extras-rpms
    sudo subscription-manager repos --enable=rhel-7-for-power-9-rpms
    x86:
    sudo subscription-manager repos --enable=rhel-7-servers-optional-rpms
    sudo subscription-manager repos --enable=rhel-7-servers-extras-rpms
    sudo subscription-manager repos --enable=rhel-7-servers-rpms
  2. Install packages needed for the installation.
    sudo yum -y install wget nano bzip2
  3. Enable Fedora Project EPEL (Extra Packages for Enterprise Linux repo:
    wget https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
    sudo rpm -ihv epel-release-latest-7.noarch.rpm
  4. Load the latest kernel or do a full update:
    • Load the latest kernel:
      sudo yum update kernel kernel-devel kernel-tools kernel-tools-libs kernel-bootwrapper
      reboot
    • Do a full update:
      sudo yum update
      sudo reboot
  5. Set up nvidia-docker 2.0 to allow PowerAI Vision containers to use the NVIDIA GPUs. For instructions, see Using nvidia-docker 2.0 with RHEL 7

Ubuntu operating system and repository setup

  1. Install packages needed for the installation
    sudo apt-get install -y wget nano apt-transport-https ca-certificates curl software-properties-common
  2. Load the latest kernel
    sudo apt-get install linux-headers-$(uname -r)
    sudo reboot
  3. Or do a full update
    sudo apt-get update
    sudo apt-get dist-upgrade
    sudo reboot

NVIDIA Components: IBM POWER9™ specific udev rules (Red Hat only)

  1. Copy the /lib/udev/rules.d/40-redhat.rules file to the directory for user overridden rules.
    sudo cp /lib/udev/rules.d/40-redhat.rules /etc/udev/rules.d/
  2. Edit the /etc/udev/rules.d/40-redhat.rules file.
    sudo nano /etc/udev/rules.d/40-redhat.rules
  3. Comment out the entire "Memory hotadd request" section and save the change:
    # Memory hotadd request
    
    #SUBSYSTEM!="memory", ACTION!="add", GOTO="memory_hotplug_end"
    
    #PROGRAM="/bin/uname -p", RESULT=="s390*", GOTO="memory_hotplug_end"
    
    
    
    #ENV{.state}="online"
    
    #PROGRAM="/bin/systemd-detect-virt", RESULT=="none", ENV{.state}="online_movable"
    
    #ATTR{state}=="offline", ATTR{state}="$env{.state}"
    
    
    
    #LABEL="memory_hotplug_end"
  4. Optionally, delete the first line of the file, since the file was copied to a directory where it cannot be overwritten.
    # do not edit this file, it will be overwritten on update
  5. Restart the system for the changes to take effect.
    sudo reboot

Remove previously installed CUDA and NVIDIA drivers

Before installing the updated GPU driver, uninstall any previously-installed CUDA and NVIDIA drivers. Follow these steps:

  1. Remove all CUDA Toolkit and GPU driver packages.

    You can display installed CUDA and driver packages by running these commands:

    rpm -qa | egrep 'cuda.*(9-2|10-0)'
    rpm -qa | egrep '(cuda|nvidia).*(396|410)\.'

    Verify the list and remove with yum remove.

  2. Remove any CUDA Toolkit and GPU driver repository packages.

    These should have been included in step 1, but you can confirm with this command:

    rpm -qa | egrep '(cuda|nvidia).*repo'

    Use yum remove to remove any that remain.

  3. Clean the yum repository:
    sudo yum clean all
  4. Remove cuDNN and NCCL:
    sudo rm -rf /usr/local/cuda /usr/local/cuda-9.2 /usr/local/cuda-10.0
  5. Reboot the system to unload the GPU driver
    sudo shutdown -r now

Install the GPU driver (Red Hat)

Install the driver by following these steps:

  1. Download the NVIDIA GPU driver:
    • Go to NVIDIA Driver Download.
    • Select Product Type: Tesla
    • Select Product Series: P-Series (for Tesla P100) or V-Series (for Tesla V100).
    • Select Product: Tesla P100 or Tesla V100
    • Select Operating System: Linux POWER LE RHEL 7 for POWER or Linux 64-bit RHEL7 for x86, depending on your cluster architecture. Click Show all Operating Systems if your version is not available.
    • Select CUDA Toolkit: 10.1
    • Click SEARCH to go to the download link.
    • Click Download to download the driver.
  2. Install CUDA and the GPU driver.
    Note: For AC922 systems: OS and system firmware updates are required before you install the latest GPU driver.
    sudo rpm -ivh nvidia*driver-local-repo-rhel7-418.*.rpm
    sudo yum install cuda-drivers
  3. Set nvidia-persistenced to start at boot
    sudo systemctl enable nvidia-persistenced
  4. Restart to activate the driver.

Installing the GPU driver (Ubuntu)

The Deep Learning packages require the GPU driver packages from NVIDIA.

Install the GPU driver by following these steps:

  1. Download the NVIDIA GPU driver.
    • Go to NVIDIA Driver Download.
    • Select Product Type: Tesla
    • Select Product Series: V-Series
    • Select Product: Tesla V100
    • Select Operating System: Linux POWER LE Ubuntu 18.04 for POWER or Linux 64-bit Ubuntu 18.04 for x86, depending on your cluster architecture. Click Show all Operating Systems if your version is not available.
    • Select CUDA Toolkit: 10.1
    • Click SEARCH to go to the download link.
    • Click Download to download the driver.
  2. Ensure the kernel headers are installed and match the running kernel. Compare the outputs of:
    $ rpm -qa kernel-devel kernel-headers
    and
    $ uname -r
    Ensure that the kernel-devel and kernel-headers package versions exactly match the version of the running kernel. If they are not identical, bring them in sync as appropriate:
    • Install missing packages.
    • Update downlevel packages.
    • Reboot the system if the packages are newer than the active kernel.
  3. Install the GPU driver repository and cuda-drivers:
    sudo dpkg -i nvidia*driver-local-repo-ubuntu1804-418.*.deb
    sudo apt-get update
    sudo apt-get install cuda-drivers
  4. Set nvidia-persistenced to start at boot
    sudo systemctl enable nvidia-persistenced
  5. Reboot the system

Verify the GPU driver

Verify that the CUDA drivers are installed by running the /usr/bin/nvidia-smi application.

Example output
# nvidia-smi
Fri Mar 15 12:23:50 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.29       Driver Version: 418.29       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-SXM2...  On   | 00000002:01:00.0 Off |                    0 |
| N/A   50C    P0   109W / 300W |   2618MiB / 16280MiB |     43%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P100-SXM2...  On   | 00000003:01:00.0 Off |                    0 |
| N/A   34C    P0    34W / 300W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla P100-SXM2...  On   | 0000000A:01:00.0 Off |                    0 |
| N/A   48C    P0    44W / 300W |   5007MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla P100-SXM2...  On   | 0000000B:01:00.0 Off |                    0 |
| N/A   36C    P0    33W / 300W |      0MiB / 16280MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0    114476      C   /opt/miniconda2/bin/python                  2608MiB |
|    2    114497      C   /opt/miniconda2/bin/python                   958MiB |
|    2    114519      C   /opt/miniconda2/bin/python                   958MiB |
|    2    116655      C   /opt/miniconda2/bin/python                  2121MiB |
|    2    116656      C   /opt/miniconda2/bin/python                   958MiB |
+-----------------------------------------------------------------------------+
For help understanding the output, see Checking system GPU status.

Installing docker, nvidia-docker2

Use these steps in to install docker and nvidia-docker 2.

  1. For Ubuntu platforms, a Docker runtime must be installed. If there is no Docker runtime installed yet, install Docker-CE on Ubuntu.
    curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
    sudo add-apt-repository "deb [arch=ppc64el] https://download.docker.com/linux/ubuntu bionic stable"
    sudo apt-get update
    sudo apt-get install docker-ce=18.06.1~ce~3-0~ubuntu
Note:

The nvidia-docker run command must be used with docker-ce (in other words, an Ubuntu host) to leverage the GPUs from within a container.