Setting up Red Hat Enterprise Linux
Follow these steps to set up your system with Red Hat Enterprise Linux.
Upgrade to a supported version of Red Hat Enterprise Linux
- For POWER8® and x86 systems, the supported version is 7.7.
- For POWER9™ systems, the supported version is 7.6. Use the instructions below to upgrade.
Red Hat Enterprise Linux 7.5 is no longer supported. If you have Red Hat Enterprise Linux 7.5 installed, upgrade by following these instructions:
subscription-manager release --unset
yum clean all
yum update -y
reboot
Red Hat Enterprise Linux operating system and repository setup
- Enable
common
,optional
, andextra
repo channels.IBM® POWER8:sudo subscription-manager repos --enable=rhel-7-for-power-le-optional-rpms
sudo subscription-manager repos --enable=rhel-7-for-power-le-extras-rpms
sudo subscription-manager repos --enable=rhel-7-for-power-le-rpms
IBM POWER9:sudo subscription-manager repos --enable=rhel-7-for-power-9-optional-rpms
sudo subscription-manager repos --enable=rhel-7-for-power-9-extras-rpms
sudo subscription-manager repos --enable=rhel-7-for-power-9-rpms
- Install packages needed for the
installation.
sudo yum -y install wget nano bzip2
- Enable the Fedora Project Extra Packages for Enterprise Linux (EPEL)
repository:
wget https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
sudo rpm -ihv epel-release-latest-7.noarch.rpm
- Load the latest kernel or do a full update:
- Load the latest
kernel:
sudo yum install kernel-devel sudo yum update kernel kernel-devel kernel-tools kernel-tools-libs kernel-bootwrapper reboot
- Do a full update:
sudo yum install kernel-devel sudo yum update sudo reboot
- Load the latest
kernel:
System firmware
If you are running on an AC922 system, you need to update the firmware. Ensure that the system firmware is updated to at least the following levels before you install the current NVIDIA GPU driver.
The firmware series and fix levels that are required for AC922 for the current NVIDIA GPU driver are:
- 8335-GTG: OP910.30 or higher
- 8335-GTH: OP920.10 or higher
System firmware updates are available at Fix Central. To find your updates in Fix Central, follow these steps:
- Enter 8335-GTG or 8335-GTH as the Product Selector.
- Select the appropriate firmware series from the drop-down list.
- Click Continue to go to the Select fixes page.
- Select the appropriate fix level.
- Click Continue to go to the Download options page.
IBM POWER9 specific udev rules
To disable it, follow these steps:
- Copy the
/lib/udev/rules.d/40-redhat.rules
file to the directory for user overridden rules:sudo cp /lib/udev/rules.d/40-redhat.rules /etc/udev/rules.d/
- Edit the
/etc/udev/rules.d/40-redhat.rules
file:sudo nano /etc/udev/rules.d/40-redhat.rules
- Comment out the entire "Memory hotadd request" section and save the change:
# Memory hotadd request #SUBSYSTEM!="memory", ACTION!="add", GOTO="memory_hotplug_end" #PROGRAM="/bin/uname -p", RESULT=="s390*", GOTO="memory_hotplug_end" #ENV{.state}="online" #PROGRAM="/bin/systemd-detect-virt", RESULT=="none", ENV{.state}="online_movable" #ATTR{state}=="offline", ATTR{state}="$env{.state}" #LABEL="memory_hotplug_end"
- Optionally, delete the first line of the file, since the file was copied to a directory where it
cannot be
overwritten:
# do not edit this file, it will be overwritten on update
- Restart the system for the changes to take effect:
sudo reboot
Remove previously installed CUDA and NVIDIA drivers
The CUDA Toolkit, cuDNN and NCCL are provided as Conda packages and do not require separate installations. The GPU driver must still be installed separately.
Before installing the updated GPU driver, uninstall any previously-installed CUDA and NVIDIA drivers. Follow these steps:
- Remove all CUDA Toolkit and GPU driver packages.
You can display installed CUDA and driver packages by running these commands:
rpm -qa | egrep 'cuda.*(9-2|10-0)'
rpm -qa | egrep '(cuda|nvidia).*(396|410)\.'
Verify the list and remove with yum remove.
- Remove any CUDA Toolkit and GPU driver repository packages.
These should have been included in step 1, but you can confirm with this command:
rpm -qa | egrep '(cuda|nvidia).*repo'
Use yum remove to remove any that remain.
- Clean the yum repository:
sudo yum clean all
- Remove cuDNN and
NCCL:
sudo rm -rf /usr/local/cuda /usr/local/cuda-9.2 /usr/local/cuda-10.0
- Reboot the system to unload the GPU driver:
sudo shutdown -r now
Install the GPU driver
Many of the deep learning packages require the GPU driver packages to be downloaded from NVIDIA. See the WML CE prerequisites for the required and recommended versions of these components.
Install the GPU driver by following these steps:
- Download the NVIDIA GPU driver:
- Go to NVIDIA Driver Download.
- Select Product Type: Tesla.
- Select Product Series: P-Series (for Tesla P100) or V-Series (for Tesla V100).
- Select Product: Tesla P100 or Tesla V100.
- Select Operating System: Linux POWER LE RHEL 7 . Click Show all Operating Systems if your version is not available.
- Select CUDA Toolkit: 10.1.
- Click SEARCH to go to the download link.
- Click Download to download the driver.
- Install the GPU driver repository and
cuda-drivers:
sudo rpm -ivh nvidia-driver-local-repo-rhel7-418.*.rpm
sudo yum install nvidia-driver-latest-dkms
- Set nvidia-persistenced to start at
boot:
sudo systemctl enable nvidia-persistenced
- Reboot the system.
Installing Mellanox drivers
In order to use Infiniband with IBM Distributed Deep Learning and SnapML, install the latest Mellanox Driver from the Mellanox IBM Systems and Storage page.
Installing Perl
sudo yum install perl