Set up your system

Before installing IBM Watson Machine Learning Accelerator, perform the following setup on your system.

Prerequisites

  • Ensure that you have root access to all hosts running deep learning workloads.
  • Ensure that all hardware and software requirements are met: Hardware and software requirements.
  • Use fully qualified domain names (FQDN) for all hosts in your cluster. The names of the hosts must use a valid domain name server (DNS), so that you can resolve the IP address by the domain name and be able to find the domain name by IP address. Use the following commands to confirm host names in your cluster: hostname -f and getent hosts [ip_address]. The host names that are returned by these commands must match your cluster configuration.
  • All hosts in the cluster must use the same clock setting.
  • Python 2.7 must be installed on all hosts.
  • OpenSSL 1.0.1 or later must be installed on all hosts.
  • All hosts require the gettext library to provide globalization support for translated product messages. Without this library, you might encounter a gettext.sh: file not found or gettext: command not found error during installation. Typically, this library is installed with the operating system; however, if it was removed or is not installed, install the gettext package.
  • If you will enable SSL communication, install cURL for Elastic Stack 7.28 or later on all management hosts and all hosts that will be used to run notebooks.
  • Remote shell (rsh) must be available on each host in the cluster.
Virus scanning: It is recommended that you disable real-time anti-virus software and any defragmentation software. These tools cause poor performance and instability, especially on management hosts, and create problems if they lock files while scanning them. Also, schedule virus scanning during cluster downtime.

Install the operating system

The Deep Learning packages require one of the following operating systems:

RHEL 7.7 little endian and RHEL 7.6 little endian for POWER9™
  • WML CE can be installed and run directly on a bare-metal RHEL system.
  • WML CE can be run from a container on a RHEL system. For more information about setting up a host to run WML CE Docker containers, see the appropriate topic:
  • The RHEL installation image and license must be acquired from Red Hat.

For more information about installing operating systems on IBM® Power Systems servers, see Quick start guides for Linux on IBM® Power System servers.

Red Hat Enterprise Linux® (RHEL) 7.6 (Linux 64-bit)

Open necessary ports

If a firewall is enabled, the following default ports must be granted access on all management hosts for IBM Spectrum Conductor Deep Learning Impact: 9243, 9280, 5000, 5001, 27017, and 6379. If you change these ports after installation, make sure to update firewall rules accordingly.

Review this topic to determine which ports need to be opened for IBM Spectrum Conductor™: Summary of ports used by IBM Spectrum Conductor.

Review this topic to determine which ports need to be opened for IBM Spectrum Conductor Deep Learning Impact: Summary of ports used by IBM Spectrum Conductor Deep Learning Impact.

Review this topic to determine which ports need to be opened for elastic distributed inference: Summary of ports used by elastic distributed inference.

Ensure user access of client machines to cluster hosts

Spark workload runs on non-management hosts in your cluster. Therefore, the Apache Spark UI and RESTful APIs that are available from Spark applications and the Spark history server must be accessible to your end users. This access is also required for any notebooks that you configure for use with IBM Spectrum Conductor.

If the hosts and the ports used are not accessible from your client machines, you can encounter errors when you access notebooks and IBM Spectrum Conductor user interfaces. The management hosts also must be able to access these hosts and the ports used.

Set the appropriate heap size

The default Elasticsearch installation uses a 2-4 GB heap for the Elasticsearch services. Elasticsearch recommends that you assign 50 percent of available memory to the Elasticsearch client service, but not exceed 30.5 GB. Based on these recommendations, configure the Elasticsearch client and data services heap in IBM Spectrum Conductor to use 6~8 GB. Further, the default garbage collector for Elasticsearch is Concurrent-Mark and Sweep (CMS). To prevent long stop-the-world pauses, do not configure the heap size to be higher than what the CMS garbage collector was designed for (approximately 6-8 GB).

For instructions to change the heap size, see How do I change the heap size for Elasticsearch?.

Log in with root permission

The following tasks all require that you log in as a user that has root or sudo to root permission.

Create cluster administration accounts

If you set up users on your hosts (both management and compute hosts), the execution user must use the same user ID (UID), and group ID (GID) on all of the hosts. For example, the UID and GID for the CLUSTERADMIN account must be the same on all hosts. By default, CLUSTERADMIN is set to egoadmin.

Mount a shared file system

If you are using multiple nodes, you must mount a shared file system. The shared file system is used for user data, such as datasets, tuning data, validation results, training models and more. In this step, the default cluster administrator account (egoadmin) is used and the mount points are /dli_shared_fs and /dli_result_fs. Optionally, /dli_data_fs can be used for additional user data. The shared file system must meet these requirements:

  • The shared file system must be mounted to a clean directory. If you are reinstalling IBM Spectrum Conductor Deep Learning Impact, make sure that the directory specified is empty.
  • The shared file system must have a minimum of 2 GB of free disk space.
  • The cluster administrator account (the account that was specified by the CLUSTERADMIN variable during IBM Spectrum Conductor installation) must have read and write permissions to the shared file system.

To verify that you mounted the shared file system correctly, assuming that cluster administrator account is egoadmin and the mount points are /dli_shared_fs and /dli_result_fs, follow these steps:

  1. Export the environment variables:
    Note: The directory specified as the shared file system must exist. Before exporting the shared file systtensem environment variable, make sure that the directory specified exists, if not, manually create it.
    
    export CLUSTERADMIN=egoadmin
    export ADMINGROUP=egoadmin
    export DLI_SHARED_FS=/dli_shared_fs
    export DLI_RESULT_FS=/dli_result_fs
  2. Change the ownership of DLI_SHARED_FS to CLUSTERADMIN:
    chown -Rh $CLUSTERADMIN:$ADMINGROUP $DLI_SHARED_FS
  3. Make sure DLI_SHARED_FS is owned by CLUSTERADMIN and remove all other access from DLI_SHARED_FS:
    chmod -R 755 $DLI_SHARED_FS
  4. Set the correct ownership for DLI_RESULT_FS, which is the mount point for shared result data storage:
    chown $CLUSTERADMIN:$ADMINGROUP $DLI_RESULT_FS
    chmod 733 $DLI_RESULT_FS
    chmod o+t $DLI_RESULT_FS
  5. Export DLI_DATA_FS:
    export DLI_DATA_FS=/dli_data_fs
    You must set the permission for this file system shared storage such that the deep learning workload submission user can read the files from this directory. If your are using Caffe models, the directory structure also needs to be writable. For example:
    chmod -R 755 $DLI_DATA_FS

Install utilities and packages

All hosts require the following utilities and packages:
  • bind-utils - provides the nslookup tool.
  • iproute - provides the SS utility that the built-in Zeppelin notebook uses.
  • net-tools package for RHEL.

Red Hat Enterprise Linux operating system and repository setup

  1. Enable common, optional, and extra repo channels.
    IBM POWER8®:
    sudo subscription-manager repos --enable=rhel-7-for-power-le-optional-rpms
    sudo subscription-manager repos --enable=rhel-7-for-power-le-extras-rpms
    sudo subscription-manager repos --enable=rhel-7-for-power-le-rpms
    IBM POWER9:
    sudo subscription-manager repos --enable=rhel-7-for-power-9-optional-rpms
    sudo subscription-manager repos --enable=rhel-7-for-power-9-extras-rpms
    sudo subscription-manager repos --enable=rhel-7-for-power-9-rpms
    x86:
    sudo subscription-manager repos --enable=rhel-7-server-optional-rpms
    sudo subscription-manager repos --enable=rhel-7-server-extras-rpms
    sudo subscription-manager repos --enable=rhel-7-server-rpms
  2. Install packages needed for the installation.
    sudo yum -y install wget nano bzip2
  3. Enable the Fedora Project Extra Packages for Enterprise Linux (EPEL) repository:
    wget https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
    sudo rpm -ihv epel-release-latest-7.noarch.rpm
  4. Load the latest kernel or do a full update:
    • Load the latest kernel:
      • For x86:
        sudo yum install kernel-devel
        sudo yum update kernel kernel-devel kernel-tools kernel-tools-libs
        reboot
      • For POWER:
        sudo yum install kernel-devel
        sudo yum update kernel kernel-devel kernel-tools kernel-tools-libs kernel-bootwrapper
        reboot
    • Do a full update:
      sudo yum install kernel-devel
      sudo yum update
      sudo reboot

IBM POWER9 specific udev rules

Before you install the NVIDIA components, the udev Memory Auto-Onlining Rule must be disabled for the CUDA driver to function properly.
Note: If you upgraded from a previous release, repeat this step with RHEL 7.6.

To disable it, follow these steps:

  1. Copy the /lib/udev/rules.d/40-redhat.rules file to the directory for user overridden rules:
    sudo cp /lib/udev/rules.d/40-redhat.rules /etc/udev/rules.d/
  2. Edit the /etc/udev/rules.d/40-redhat.rules file:
    sudo nano /etc/udev/rules.d/40-redhat.rules
  3. Comment out the entire "Memory hotadd request" section and save the change:
    # Memory hotadd request
    #SUBSYSTEM!="memory", ACTION!="add", GOTO="memory_hotplug_end"
    #PROGRAM="/bin/uname -p", RESULT=="s390*", GOTO="memory_hotplug_end"
    
    #ENV{.state}="online"
    #PROGRAM="/bin/systemd-detect-virt", RESULT=="none", ENV{.state}="online_movable"
    #ATTR{state}=="offline", ATTR{state}="$env{.state}"
    
    #LABEL="memory_hotplug_end"
  4. Optionally, delete the first line of the file, since the file was copied to a directory where it cannot be overwritten:
    # do not edit this file, it will be overwritten on update
  5. Restart the system for the changes to take effect:
    sudo reboot

Install the kernel development packages

Load the latest kernel or do a full update:
  • On Red Hat:
    • Load the latest kernel:
      • For x86:
        sudo yum install kernel-devel
        sudo yum update kernel kernel-devel kernel-tools kernel-tools-libs
        reboot
      • For POWER:
        sudo yum install kernel-devel
        sudo yum update kernel kernel-devel kernel-tools kernel-tools-libs kernel-bootwrapper
        reboot
    • Do a full update:
      sudo yum install kernel-devel
      sudo yum update
      sudo reboot

Remove previously installed CUDA and NVIDIA drivers

The CUDA Toolkit, cuDNN and NCCL are provided as Conda packages and do not require separate installations. The GPU driver must still be installed separately.

Note: If you require the CUDA Toolkit on the host for uses beyond WML CE, consult NVIDIA's CUDA documentation for help upgrading the GPU driver without disturbing your existing Toolkit installation.

Before installing the updated GPU driver, uninstall any previously-installed CUDA and NVIDIA drivers. Follow these steps:

  1. Remove all CUDA Toolkit and GPU driver packages.

    You can display installed CUDA and driver packages by running these commands:

    rpm -qa | egrep 'cuda.*(9-2|10-0|10-1)'
    rpm -qa | egrep '(cuda|nvidia).*(396|410|418)\.'

    Verify the list and remove with yum remove.

  2. Remove any CUDA Toolkit and GPU driver repository packages.

    These should have been included in step 1, but you can confirm with this command:

    rpm -qa | egrep '(cuda|nvidia).*repo'

    Use yum remove to remove any that remain.

  3. Clean the yum repository:
    sudo yum clean all
  4. Remove cuDNN and NCCL:
    sudo rm -rf /usr/local/cuda /usr/local/cuda-9.2 /usr/local/cuda-10.0 /usr/local/cuda-10.1
  5. Reboot the system to unload the GPU driver:
    sudo shutdown -r now

Install the GPU driver

Many of the deep learning packages require the GPU driver packages to be downloaded from NVIDIA.

Install the GPU driver by following these steps:

  1. Download the NVIDIA GPU driver:
    • Go to NVIDIA Driver Download.
    • Select Product Type: Tesla.
    • Select Product Series: P-Series (for Tesla P100) or V-Series (for Tesla V100).
    • Select Product: Tesla P100 or Tesla V100.
    • Select Operating System, click Show all Operating Systems, then choose the appropriate value:
      • Linux POWER LE RHEL 7 for Power
      • Linux 64-bit RHEL7 for x86
    • Select CUDA Toolkit: 10.2.
    • Click SEARCH to go to the download link.
    • Click Download to download the driver.
      Important: An rpm file should be downloaded. If a different type of file is downloaded, verify that you chose the correct options and try again.
  2. Install the GPU driver repository and cuda-drivers:
    sudo rpm -ivh nvidia-driver-local-repo-rhel7-440.*.rpm
    sudo yum install nvidia-driver-latest-dkms
  3. Set nvidia-persistenced to start at boot (required for ppc64le, recommended for x86):
    sudo systemctl enable nvidia-persistenced
  4. Reboot the system.

Configure GPU mode to exclusive across all nodes

From the command line interface, do the following:
  1. Set GPU mode to exclusive process mode.
    nvidia-smi -c 1
  2. Ensure GPU mode is set to exclusive process mode.
    nvidia-smi

Configure the required limits for the maximum number of processors and the maximum number of open files

For both the root user and the cluster administrator (egoadmin), you must configure the required limits for the maximum number of processors (nprocs) and the maximum number of files open (nofile) on your hosts. The limits for the root user must be 65536 or more, and the limit for the cluster administrator must be 65536. Without this limit, services hang or enter the Error state on cluster startup.

  1. In the /etc/security/limits.conf file, set nproc and nofile to 65536 for root and the cluster administrator. In the following example, the cluster administrator is named egoadmin:
    root   soft    nproc     65536
    root   hard    nproc     65536
    root   soft    nofile    65536
    root   hard    nofile    65536
    egoadmin   soft    nproc     65536
    egoadmin   hard    nproc     65536
    egoadmin   soft    nofile    65536
    egoadmin   hard    nofile    65536
  2. Log out and then log back in to the server for the changes to take effect.

Set the vm_max_map_count kernel value

Set the vm_max_map_count kernel value to 262144 or more:
  1. Set the kernel value dynamically to ensure that the change takes effect immediately:

    sysctl -w vm.max_map_count=262144

  2. Set the kernel value in the /etc/sysctl.conf file to ensure that the change is still in effect when you restart your host:

    vm.max_map_count=262144