Set up your system
Before installing IBM Watson Machine Learning Accelerator, perform the following setup on your system.
- Prerequisites
- Install the operating system
- Open necessary ports
- Ensure user access of client machines to cluster hosts
- Set the appropriate heap size
- Log in with root permission
- Create cluster administration accounts
- Mount a shared file system
- Install utilities and packages
- Red Hat Enterprise Linux operating system and repository setup
- IBM POWER9 specific udev rules
- Install the kernel development packages
- Remove previously installed CUDA and NVIDIA drivers
- Install the GPU driver
- Configure the required limits for the maximum number of processors and the maximum number of open files
- Set the vm_max_map_count kernel value
Prerequisites
- Ensure that you have root access to all hosts running deep learning workloads.
- Ensure that all hardware and software requirements are met: Hardware and software requirements.
- Use fully qualified domain names (FQDN) for all hosts in your cluster. The names of the hosts
must use a valid domain name server (DNS), so that you can resolve the IP address by the domain name
and be able to find the domain name by IP address. Use the following commands to confirm host names
in your cluster:
hostname -f and getent hosts [ip_address]
. The host names that are returned by these commands must match your cluster configuration. - All hosts in the cluster must use the same clock setting.
- Python 2.7 must be installed on all hosts.
- OpenSSL 1.0.1 or later must be installed on all hosts.
- All hosts require the gettext library to provide globalization support for translated product messages. Without this library, you might encounter a gettext.sh: file not found or gettext: command not found error during installation. Typically, this library is installed with the operating system; however, if it was removed or is not installed, install the gettext package.
- If you will enable SSL communication, install cURL for Elastic Stack 7.28 or later on all management hosts and all hosts that will be used to run notebooks.
- Remote shell (rsh) must be available on each host in the cluster.
Install the operating system
The Deep Learning packages require one of the following operating systems:
- RHEL 7.7 little endian and RHEL 7.6 little endian for POWER9™
-
- WML CE can be installed and run directly on a bare-metal RHEL system.
- WML CE can be run from a container on a RHEL
system. For more information about setting up a host to run WML CE Docker containers, see the appropriate topic:
- For Power®: Using nvidia-docker 2.0 with RHEL 7
- For x86: nvidia-docker for RHEL
- The RHEL installation image and license must be acquired from Red Hat.
For more information about installing operating systems on IBM® Power Systems servers, see Quick start guides for Linux on IBM® Power System servers.
- Red Hat Enterprise Linux® (RHEL) 7.6 (Linux 64-bit)
Open necessary ports
If a firewall is enabled, the following default ports must be granted access on all management hosts for IBM Spectrum Conductor Deep Learning Impact: 9243, 9280, 5000, 5001, 27017, and 6379. If you change these ports after installation, make sure to update firewall rules accordingly.
Review this topic to determine which ports need to be opened for IBM Spectrum Conductor™: Summary of ports used by IBM Spectrum Conductor.
Review this topic to determine which ports need to be opened for IBM Spectrum Conductor Deep Learning Impact: Summary of ports used by IBM Spectrum Conductor Deep Learning Impact.
Review this topic to determine which ports need to be opened for elastic distributed inference: Summary of ports used by elastic distributed inference.
Ensure user access of client machines to cluster hosts
Spark workload runs on non-management hosts in your cluster. Therefore, the Apache Spark UI and RESTful APIs that are available from Spark applications and the Spark history server must be accessible to your end users. This access is also required for any notebooks that you configure for use with IBM Spectrum Conductor.
If the hosts and the ports used are not accessible from your client machines, you can encounter errors when you access notebooks and IBM Spectrum Conductor user interfaces. The management hosts also must be able to access these hosts and the ports used.
Set the appropriate heap size
The default Elasticsearch installation uses a 2-4 GB heap for the Elasticsearch services. Elasticsearch recommends that you assign 50 percent of available memory to the Elasticsearch client service, but not exceed 30.5 GB. Based on these recommendations, configure the Elasticsearch client and data services heap in IBM Spectrum Conductor to use 6~8 GB. Further, the default garbage collector for Elasticsearch is Concurrent-Mark and Sweep (CMS). To prevent long stop-the-world pauses, do not configure the heap size to be higher than what the CMS garbage collector was designed for (approximately 6-8 GB).
For instructions to change the heap size, see How do I change the heap size for Elasticsearch?.
Log in with root permission
The following tasks all require that you log in as a user that has root or sudo to root permission.
Create cluster administration accounts
If you set up users on your hosts (both management and compute hosts), the execution user must use the same user ID (UID), and group ID (GID) on all of the hosts. For example, the UID and GID for the CLUSTERADMIN account must be the same on all hosts. By default, CLUSTERADMIN is set to egoadmin.
Mount a shared file system
If you are using multiple nodes, you must mount a shared file system. The shared file system is used for user data, such as datasets, tuning data, validation results, training models and more. In this step, the default cluster administrator account (egoadmin) is used and the mount points are /dli_shared_fs and /dli_result_fs. Optionally, /dli_data_fs can be used for additional user data. The shared file system must meet these requirements:
- The shared file system must be mounted to a clean directory. If you are reinstalling IBM Spectrum Conductor Deep Learning Impact, make sure that the directory specified is empty.
- The shared file system must have a minimum of 2 GB of free disk space.
- The cluster administrator account (the account that was specified by the CLUSTERADMIN variable during IBM Spectrum Conductor installation) must have read and write permissions to the shared file system.
To verify that you mounted the shared file system correctly, assuming that cluster administrator account is egoadmin and the mount points are /dli_shared_fs and /dli_result_fs, follow these steps:
- Export the environment variables:Note: The directory specified as the shared file system must exist. Before exporting the shared file systtensem environment variable, make sure that the directory specified exists, if not, manually create it.
export CLUSTERADMIN=egoadmin export ADMINGROUP=egoadmin export DLI_SHARED_FS=/dli_shared_fs export DLI_RESULT_FS=/dli_result_fs
- Change the ownership of DLI_SHARED_FS to
CLUSTERADMIN:
chown -Rh $CLUSTERADMIN:$ADMINGROUP $DLI_SHARED_FS
- Make sure DLI_SHARED_FS is owned by
CLUSTERADMIN and remove all other access from
DLI_SHARED_FS:
chmod -R 755 $DLI_SHARED_FS
- Set the correct ownership for DLI_RESULT_FS, which is the
mount point for shared result data
storage:
chown $CLUSTERADMIN:$ADMINGROUP $DLI_RESULT_FS chmod 733 $DLI_RESULT_FS chmod o+t $DLI_RESULT_FS
- Export
DLI_DATA_FS:
You must set the permission for this file system shared storage such that the deep learning workload submission user can read the files from this directory. If your are using Caffe models, the directory structure also needs to be writable. For example:export DLI_DATA_FS=/dli_data_fs
chmod -R 755 $DLI_DATA_FS
Install utilities and packages
- bind-utils - provides the
nslookup
tool. - iproute - provides the SS utility that the built-in Zeppelin notebook uses.
- net-tools package for RHEL.
Red Hat Enterprise Linux operating system and repository setup
- Enable
common
,optional
, andextra
repo channels.IBM POWER8®:sudo subscription-manager repos --enable=rhel-7-for-power-le-optional-rpms
sudo subscription-manager repos --enable=rhel-7-for-power-le-extras-rpms
sudo subscription-manager repos --enable=rhel-7-for-power-le-rpms
IBM POWER9:sudo subscription-manager repos --enable=rhel-7-for-power-9-optional-rpms
sudo subscription-manager repos --enable=rhel-7-for-power-9-extras-rpms
sudo subscription-manager repos --enable=rhel-7-for-power-9-rpms
x86:sudo subscription-manager repos --enable=rhel-7-server-optional-rpms
sudo subscription-manager repos --enable=rhel-7-server-extras-rpms
sudo subscription-manager repos --enable=rhel-7-server-rpms
- Install packages needed for the
installation.
sudo yum -y install wget nano bzip2
- Enable the Fedora Project Extra Packages for Enterprise Linux (EPEL)
repository:
wget https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
sudo rpm -ihv epel-release-latest-7.noarch.rpm
- Load the latest kernel or do a full update:
- Load the latest kernel:
- For x86:
sudo yum install kernel-devel sudo yum update kernel kernel-devel kernel-tools kernel-tools-libs reboot
- For POWER:
sudo yum install kernel-devel sudo yum update kernel kernel-devel kernel-tools kernel-tools-libs kernel-bootwrapper reboot
- For x86:
- Do a full update:
sudo yum install kernel-devel sudo yum update sudo reboot
- Load the latest kernel:
IBM POWER9 specific udev rules
To disable it, follow these steps:
- Copy the
/lib/udev/rules.d/40-redhat.rules
file to the directory for user overridden rules:sudo cp /lib/udev/rules.d/40-redhat.rules /etc/udev/rules.d/
- Edit the
/etc/udev/rules.d/40-redhat.rules
file:sudo nano /etc/udev/rules.d/40-redhat.rules
- Comment out the entire "Memory hotadd request" section and save the change:
# Memory hotadd request #SUBSYSTEM!="memory", ACTION!="add", GOTO="memory_hotplug_end" #PROGRAM="/bin/uname -p", RESULT=="s390*", GOTO="memory_hotplug_end" #ENV{.state}="online" #PROGRAM="/bin/systemd-detect-virt", RESULT=="none", ENV{.state}="online_movable" #ATTR{state}=="offline", ATTR{state}="$env{.state}" #LABEL="memory_hotplug_end"
- Optionally, delete the first line of the file, since the file was copied to a directory where it
cannot be
overwritten:
# do not edit this file, it will be overwritten on update
- Restart the system for the changes to take effect:
sudo reboot
Install the kernel development packages
- On Red Hat:
- Load the latest kernel:
- For x86:
sudo yum install kernel-devel sudo yum update kernel kernel-devel kernel-tools kernel-tools-libs reboot
- For POWER:
sudo yum install kernel-devel sudo yum update kernel kernel-devel kernel-tools kernel-tools-libs kernel-bootwrapper reboot
- For x86:
- Do a full update:
sudo yum install kernel-devel sudo yum update sudo reboot
- Load the latest kernel:
Remove previously installed CUDA and NVIDIA drivers
The CUDA Toolkit, cuDNN and NCCL are provided as Conda packages and do not require separate installations. The GPU driver must still be installed separately.
Before installing the updated GPU driver, uninstall any previously-installed CUDA and NVIDIA drivers. Follow these steps:
- Remove all CUDA Toolkit and GPU driver packages.
You can display installed CUDA and driver packages by running these commands:
rpm -qa | egrep 'cuda.*(9-2|10-0|10-1)'
rpm -qa | egrep '(cuda|nvidia).*(396|410|418)\.'
Verify the list and remove with yum remove.
- Remove any CUDA Toolkit and GPU driver repository packages.
These should have been included in step 1, but you can confirm with this command:
rpm -qa | egrep '(cuda|nvidia).*repo'
Use yum remove to remove any that remain.
- Clean the yum repository:
sudo yum clean all
- Remove cuDNN and
NCCL:
sudo rm -rf /usr/local/cuda /usr/local/cuda-9.2 /usr/local/cuda-10.0 /usr/local/cuda-10.1
- Reboot the system to unload the GPU driver:
sudo shutdown -r now
Install the GPU driver
Many of the deep learning packages require the GPU driver packages to be downloaded from NVIDIA.
Install the GPU driver by following these steps:
- Download the NVIDIA GPU driver:
- Go to NVIDIA Driver Download.
- Select Product Type: Tesla.
- Select Product Series: P-Series (for Tesla P100) or V-Series (for Tesla V100).
- Select Product: Tesla P100 or Tesla V100.
- Select Operating System, click Show all Operating Systems, then choose
the appropriate value:
- Linux POWER LE RHEL 7 for Power
- Linux 64-bit RHEL7 for x86
- Select CUDA Toolkit: 10.2.
- Click SEARCH to go to the download link.
- Click Download to download the driver.Important: An rpm file should be downloaded. If a different type of file is downloaded, verify that you chose the correct options and try again.
- Install the GPU driver repository and
cuda-drivers:
sudo rpm -ivh nvidia-driver-local-repo-rhel7-440.*.rpm
sudo yum install nvidia-driver-latest-dkms
- Set nvidia-persistenced to start at boot (required for
ppc64le, recommended for x86):
sudo systemctl enable nvidia-persistenced
- Reboot the system.
Configure GPU mode to exclusive across all nodes
- Set GPU mode to exclusive process mode.
nvidia-smi -c 1
- Ensure GPU mode is set to exclusive process mode.
nvidia-smi
Configure the required limits for the maximum number of processors and the maximum number of open files
For both the root user and the cluster administrator (egoadmin), you must configure the required limits for the maximum number of processors (nprocs) and the maximum number of files open (nofile) on your hosts. The limits for the root user must be 65536 or more, and the limit for the cluster administrator must be 65536. Without this limit, services hang or enter the Error state on cluster startup.
- In the /etc/security/limits.conf file, set nproc and nofile to 65536 for
root and the cluster administrator. In the following example, the cluster administrator is named
egoadmin:
root soft nproc 65536 root hard nproc 65536 root soft nofile 65536 root hard nofile 65536
egoadmin soft nproc 65536 egoadmin hard nproc 65536 egoadmin soft nofile 65536 egoadmin hard nofile 65536
- Log out and then log back in to the server for the changes to take effect.
Set the vm_max_map_count kernel value
vm_max_map_count kernel
value to 262144 or more:- Set the kernel value dynamically to ensure that the change takes effect
immediately:
sysctl -w vm.max_map_count=262144
- Set the kernel value in the /etc/sysctl.conf file to ensure that the change
is still in effect when you restart your
host:
vm.max_map_count=262144