PowerAI system setup

Find information to set up your operating system, repository, and NVIDIA components.

Operating system
Red Hat Enterprise Linux operating system and repository setup
Ubuntu operating system and repository setup
System firmware
IBM POWER9 specific udev rules (Red Hat only)
Install the kernel development packages
Remove previously installed CUDA and NVIDIA drivers (Red Hat only)
CUDA, GPU driver, cuDNN, and NCCL (Red Hat only)
NVIDIA Persistence Daemon (Red Hat only)
GPU driver, docker, nvidia-docker2 (Ubuntu only)
Anaconda

For AC922 systems that use the latest NVIDIA GPU driver, the GPU driver requires other updates that must be installed in a specific order:

Latest Linux kernel for RHEL 7.5 ALT
You can also run PowerAI in a container on a bare metal system that is running Ubuntu 18.04.
Recent AC922 system firmware:
- 8335-GTG: OP910.24
- 8335-GTH: OP920.02
NVIDIA GPU driver 410.72 or higher

Operating system

The Deep Learning packages require specific operating systems:

Red Hat Enterprise Linux (RHEL) 7.5 little endian for IBM® POWER8® and IBM POWER9™

PowerAI can be installed and run directly on a bare-metal RHEL 7.5 system
PowerAI can also be run from a container on a RHEL 7.5 system. For more information about setting up a container to run PowerAI, see Using nvidia-docker 2.0 with RHEL 7.
The RHEL installation image and license must be acquired from Red Hat

Ubuntu 18.04 LTS for IBM Power

PowerAI must be run in container when running on a bare-metal Ubuntu 18.04 system
The Ubuntu installation image can be downloaded from Ubuntu

Table 1. Supported configurations
Host OS	Container OS
Red Hat Enterprise Linux 7.5	Ubuntu 18.04
Ubuntu 18.04	Ubuntu 18.04
Red Hat Enterprise Linux 7.5	none (Bare metal)

For more information about installing operating systems on IBM Power Systems servers, see Quick start guides for Linux on IBM® Power System servers.

Red Hat Enterprise Linux operating system and repository setup

Enable common, optional, and extra repo channels.

IBM POWER8:

sudo subscription-manager repos --enable=rhel-7-for-power-le-optional-rpms

sudo subscription-manager repos --enable=rhel-7-for-power-le-extras-rpms

sudo subscription-manager repos --enable=rhel-7-for-power-le-rpms

IBM POWER9:

sudo subscription-manager repos --enable=rhel-7-for-power-9-optional-rpms

sudo subscription-manager repos --enable=rhel-7-for-power-9-extras-rpms

sudo subscription-manager repos --enable=rhel-7-for-power-9-rpms

Install packages needed for the installation.
```
sudo yum -y install wget nano bzip2
```

Enable Fedora Project EPEL (Extra Packages for Enterprise Linux) repo:

wget https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm

sudo rpm -ihv epel-release-latest-7.noarch.rpm

Load the latest kernel or do a full update:
- Load the latest kernel:
```
sudo yum update kernel kernel-tools kernel-tools-libs kernel-bootwrapper
```
```
reboot
```
- Do a full update:
```
sudo yum update
```
```
sudo reboot
```
  Important: RHEL 7.6 was released at the end of October, but is not yet supported by PowerAI. Running just yum update might upgrade a 7.5 system to 7.6. In order to avoid this, customers with a standard RHEL subscription might use:
```
sudo subscription-manager release --set=7.5
```
  Customers should consult Red Hat if they’re unsure how to avoid unintended upgrade.

Ubuntu operating system and repository setup

Install packages needed for the installation

sudo apt-get install -y wget nano apt-transport-https ca-certificates curl software-properties-common

Load the latest kernel

sudo apt-get install linux-headers-$(uname -r)
sudo reboot

Or do a full update

sudo apt-get update
sudo apt-get dist-upgrade
sudo reboot

System firmware

If you are running on an AC922 system, you need to update the firmware. Ensure that the system firmware is updated to at least the following levels before you install the current NVIDIA GPU driver.

The firmware series and fix levels that are required for AC922 for the current NVIDIA GPU driver are:

8335-GTG: OP910.24 or higher
8335-GTH: OP920.02 or higher

System firmware updates are available at Fix Central. To find your updates in Fix Central, follow these steps:

Enter 8335-GTG or 8335-GTH as the Product Selector.
Select the appropriate firmware series from the drop-down list.
Click Continue to go to the Select fixes page.
Select the appropriate fix level.
Click Continue to go to the Download options page.

IBM POWER9 specific udev rules (Red Hat only)

Before you install the NVIDIA components, the udev Memory Auto-Onlining Rule must be disabled for the CUDA driver to function properly. To disable it, follow these steps:

Copy the /lib/udev/rules.d/40-redhat.rules file to the directory for user overridden rules.
```
sudo cp /lib/udev/rules.d/40-redhat.rules /etc/udev/rules.d/
```
Edit the /etc/udev/rules.d/40-redhat.rules file.
```
sudo nano /etc/udev/rules.d/40-redhat.rules
```

Comment out the following line and save the change:

SUBSYSTEM=="memory", ACTION=="add", PROGRAM="/bin/uname -p", 
RESULT!="s390*", ATTR{state}=="offline", ATTR{state}="online"

Optionally, delete the first line of the file, since the file was copied to a directory where it cannot be overwritten.
```
# do not edit this file, it will be overwritten on update
```
Restart the system for the changes to take effect.
```
sudo reboot
```

Install the kernel development packages

Install the kernel development packages for the currently running kernel by running the following command:

On Red Hat:

sudo yum install kernel-devel-$(uname -r) kernel-headers-$(uname -r)

On Ubuntu:

sudo apt-get install linux-headers-$(uname -r)

Remove previously installed CUDA and NVIDIA drivers (Red Hat only)

Before installing CUDA 10, uninstall any previous installations of CUDA and NVIDIA drivers. Follow these steps:

Run the following command:
```
sudo yum remove libglvnd*
```
Note that removing libglvnd* also uninstalls the Nvidia drivers.

Verify that the drivers were uninstalled:

sudo yum list installed | grep cuda

If a previous local repo is found, uninstall it:

sudo rpm -e cuda-repo-rhel7-9-2-local-9.2.148-1.ppc64le

Finally run
```
sudo yum clean all
```
If you get a message to remove the yum cache, run
```
sudo rm -rf /var/cache/yum
```

CUDA, GPU driver, cuDNN, and NCCL (Red Hat only)

The Deep Learning packages require CUDA, cuDNN, and GPU driver packages from NVIDIA. See the PowerAI prerequisites for the required and recommended versions of these components.

Install the components by following these steps:

Download NVIDIA CUDA 10
- Select Operating System: Linux.
- Select Architecture: ppc64le.
- Select Distribution: RHEL.
- Select Version: 7.
- Select Installer Type: rpm (network).
- Follow the Linux on POWER installation instructions in the CUDA Quick Start Guide, including the steps that describe how to set up the CUDA development environment by updating PATH and LD_LIBRARY_PATH.
Download NVIDIA driver 410
- Select Product Type: Tesla
- Select Product Series: P-Series
- Select Product: Tesla P100
- Select Operating System: Linux POWER LE RHEL 7
- Select CUDA Toolkit: 10.0
- Click Search to go do the download link.
Note: See Table 1 for supported and recommended drivers.
Install CUDA and the GPU driver.
Note: For AC922 systems, OS and system firmware updates are required before you install the latest GPU driver.
At a high level, the installation process is:
- Install the CUDA Base repository rpm
- Install the GPU driver repository rpm
- Run sudo yum install cuda to install CUDA and the GPU driver
- Restart to activate the driver
For more information, see the Linux POWER® installation instructions in the CUDA Quick Start Guide. It includes steps for setting up the CUDA development environment by updating PATH and LD_LIBRARY_PATH.
Download NVIDIA cuDNN v7.3.1 for CUDA 10.0 (Registration in NVIDIA’s Accelerated Computing Developer Program is required).
- cuDNN v7.3.1 Library for Linux (Power8/Power9)
Download NVIDIA NCCL v2.3.5 for CUDA 10.0 (Registration in NVIDIA’s Accelerated Computing Developer Program is required).
- NCCL 2.3.5 O/S agnostic and CUDA 10.0 and IBM Power

Install the cuDNN v7.3.1 and NCCL v2.3.5 packages. Refresh shared library cache.

sudo tar -C /usr/local --no-same-owner -xzvf cudnn-10.0-linux-ppc64le-v7.3.1.20.tgz

sudo tar -C /usr/local/cuda/targets/ppc64le-linux/ --no-same-owner --strip-components=1 -xvf nccl_2.3.5-5+cuda10.0_ppc64le.txz

sudo ldconfig

NVIDIA Persistence Daemon (Red Hat only)

The NVIDIA Persistence Daemon may be automatically started for POWER9 installations. Check that it is running with the following command:

systemctl status nvidia-persistenced

If it is not active, run the following command:

sudo systemctl enable nvidia-persistenced

GPU driver, docker, nvidia-docker2 (Ubuntu only)

To run PowerAI within docker containers, only the GPU driver needs to be installed on the host.

Download NVIDIA driver 410.72 from http://www.nvidia.com/Download/index.aspx.
- Select Product Type: Tesla
- Select Product Series: P-Series
- Select Product: Tesla P100
- Select Operating System: Linux POWER LE Ubuntu 18.04 (If Linux POWER LE Ubuntu 18.04 is not available, click Show all Operating Systems)
- Select CUDA Toolkit: 10.0
- Click Search to go do the download link

Install the GPU driver repository deb package and cuda-drivers.

sudo dpkg -i nvidia-driver-local-repo-ubuntu1804-410.72_1.0-1_ppc64el.deb
sudo apt-get update
sudo apt-get install cuda-drivers

Edit the nvidia-persistenced file.

sudo systemctl edit --full nvidia-persistenced

Replace the contents with the following lines:

[Unit]
Description=NVIDIA Persistence Daemon
Wants=syslog.target

[Service]
Type=forking
PIDFile=/var/run/nvidia-persistenced/nvidia-persistenced.pid
Restart=always
ExecStart=/usr/bin/nvidia-persistenced --verbose
ExecStopPost=/bin/rm -rf /var/run/nvidia-persistenced
TimeoutSec=300

[Install]
WantedBy=multi-user.target

Set nvidia-persistenced to start at boot

sudo systemctl enable nvidia-persistenced

Restart your system.

Install docker.For Ubuntu platforms, a Docker runtime must be installed. If there is no Docker runtime installed yet, install Docker-CE on Ubuntu.

curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository "deb [arch=ppc64el] https://download.docker.com/linux/ubuntu bionic stable"
sudo apt-get update
sudo apt-get install docker-ce

Install nvidia-docker 2.

curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
sudo apt-get install nvidia-docker2
sudo pkill -SIGHUP dockerd

Verify the setup.

nvidia-docker run --rm nvidia/cuda nvidia-smi

Anaconda

A number of the Deep Learning frameworks require Anaconda. Anaconda is a platform-agnostic data science distribution with a collection of 1,000+ open source packages with free community support.

Use Anaconda2 with Python 2 to run the Python 2 versions of the Deep Learning frameworks. Anaconda3 with Python 3 is required to run the Python 3 versions of the Deep Learning frameworks.

Anaconda2, version 5.2.0
md5sum: 479633a95906ea6d41056ebe84a4c47b
Anaconda3, version 5.2.0
md5sum: cbd1d5435ead2b0b97dba5b3cf45d694

Download Anaconda:

wget https://repo.continuum.io/archive/Anaconda2-5.2.0-Linux-ppc64le.sh

Install Anaconda
```
bash Anaconda2-5.2.0-Linux-ppc64le.sh
```
```
source ~/.bashrc
```
1. Accept the license agreement
2. Specify an installation location (default is $HOME/anaconda2)
3. Set the PATH environment variable. For systems that have a single Anaconda instance, such as PowerAI Enterprise, multiple users are
  - For setups that have a single Anaconda instance for multiple users, such as PowerAI Enterprise, reply no to update the .bashrc file or .bash_profile. After the installation is complete, export the path with this command:
```
export PATH=/opt/anaconda2/bin:$PATH
```
  - For other PowerAI users, reply yes to allow the installer to update the .bashrcfile or .bash_profile. In this case, if multiple users are using the same system, each user should install Anaconda individually.

Note: Anaconda 5.3 is also supported, but contains Python 3.7, which is not supported. To use Anaconda 5.3, uninstall Python 3.7 and install 3.6.