Setting up the system and running the Automated installer

Before installing IBM Watson® Machine Learning Accelerator, perform the following setup on your system.

Note: If you do not want to use the Automated installer, use the information in this topic instead: Set up your system (Manual install).

Prerequisites

  • Ensure that you have root access to all hosts running deep learning workloads.
  • Ensure that all hardware and software requirements are met: Hardware and software requirements.
  • Use fully qualified domain names (FQDN) for all hosts in your cluster. The names of the hosts must use a valid domain name server (DNS), so that you can resolve the IP address by the domain name and be able to find the domain name by IP address. Use the following commands to confirm host names in your cluster: hostname -f and getent hosts [ip_address]. The host names that are returned by these commands must match your cluster configuration.
  • All hosts in the cluster must use the same clock setting.
  • Python 2.7 must be installed on all hosts.
  • OpenSSL 1.0.1 or later must be installed on all hosts.
  • All hosts require the gettext library to provide globalization support for translated product messages. Without this library, you might encounter a gettext.sh: file not found or gettext: command not found error during installation. Typically, this library is installed with the operating system; however, if it was removed or is not installed, install the gettext package.
  • If you will enable SSL communication, install cURL for Elastic Stack 7.28 or later on all management hosts and all hosts that will be used to run notebooks.
  • Remote shell (rsh) must be available on each host in the cluster.
Virus scanning: It is recommended that you disable real-time anti-virus software and any defragmentation software. These tools cause poor performance and instability, especially on management hosts, and create problems if they lock files while scanning them. Also, schedule virus scanning during cluster downtime.

Install the operating system

The Deep Learning packages require one of the following operating systems:

Red Hat Enterprise Linux (RHEL) 7.6 little endian for POWER8® and POWER9™
  • WML CE can be installed and run directly on a bare-metal RHEL 7.6 system
  • The RHEL installation image and license must be acquired from Red Hat

For more information about installing operating systems on IBM Power Systems servers, see Quick start guides for Linux on IBM® Power System servers.

Red Hat Enterprise Linux (RHEL) 7.6 (Linux 64-bit)

Open necessary ports

If a firewall is enabled, the following default ports must be granted access on all management hosts for IBM Spectrum Conductor Deep Learning Impact: 9243, 9280, 5000, 5001, 27017, and 6379. If you change these ports after installation, make sure to update firewall rules accordingly.

Review this topic to determine which ports need to be opened for IBM Spectrum Conductor™: Summary of ports used by IBM Spectrum Conductor.

Review this topic to determine which ports need to be opened for IBM Spectrum Conductor Deep Learning Impact: Summary of ports used by IBM Spectrum Conductor Deep Learning Impact.

Ensure user access of client machines to cluster hosts

Spark workload runs on non-management hosts in your cluster. Therefore, the Apache Spark UI and RESTful APIs that are available from Spark applications and the Spark history server must be accessible to your end users. This access is also required for any notebooks that you configure for use with IBM Spectrum Conductor.

If the hosts and the ports used are not accessible from your client machines, you can encounter errors when you access notebooks and IBM Spectrum Conductor user interfaces. The management hosts also must be able to access these hosts and the ports used.

Set the appropriate heap size

The default Elasticsearch installation uses a 2-4 GB heap for the Elasticsearch services. Elasticsearch recommends that you assign 50 percent of available memory to the Elasticsearch client service, but not exceed 30.5 GB. Based on these recommendations, configure the Elasticsearch client and data services heap in IBM Spectrum Conductor to use 6~8 GB. Further, the default garbage collector for Elasticsearch is Concurrent-Mark and Sweep (CMS). To prevent long stop-the-world pauses, do not configure the heap size to be higher than what the CMS garbage collector was designed for (approximately 6-8 GB).

For instructions to change the heap size, see How do I change the heap size for Elasticsearch?.

Log in with root permission

The following tasks all require that you log in as a user that has root or sudo to root permission.

Mount a shared file system

If you are using multiple nodes, you must mount a shared file system. The shared file system is used for user data, such as datasets, tuning data, validation results, training models and more. In this step, the default cluster administrator account (egoadmin) is used and the mount points are /dli_shared_fs and /dli_result_fs. Optionally, /dli_data_fs can be used for additional user data. The shared file system must meet these requirements:

  • The shared file system must be mounted to a clean directory. If you are reinstalling IBM Spectrum Conductor Deep Learning Impact, make sure that the directory specified is empty.
  • The shared file system must have a minimum of 2 GB of free disk space.
  • The cluster administrator account (the account that was specified by the CLUSTERADMIN variable during IBM Spectrum Conductor installation) must have read and write permissions to the shared file system.

To verify that you mounted the shared file system correctly, assuming that cluster administrator account is egoadmin and the mount points are /dli_shared_fs and /dli_result_fs, follow these steps:

  1. Export the environment variables:
    Note: The directory specified as the shared file system must exist. Before exporting the shared file systtensem environment variable, make sure that the directory specified exists, if not, manually create it.
    export CLUSTERADMIN=egoadmin
    export ADMINGROUP=egoadmin
    export DLI_SHARED_FS=/dli_shared_fs
    export DLI_RESULT_FS=/dli_result_fs
  2. Change the ownership of DLI_SHARED_FS to CLUSTERADMIN:
    chown -Rh $CLUSTERADMIN:$ADMINGROUP $DLI_SHARED_FS
  3. Make sure DLI_SHARED_FS is owned by CLUSTERADMIN and remove all other access from DLI_SHARED_FS:
    chmod -R 755 $DLI_SHARED_FS
  4. Set the correct ownership for DLI_RESULT_FS, which is the mount point for shared result data storage:
    chown $CLUSTERADMIN:$ADMINGROUP $DLI_RESULT_FS
    chmod 733 $DLI_RESULT_FS
    chmod o+t $DLI_RESULT_FS
  5. Export DLI_DATA_FS:
    export DLI_DATA_FS=/dli_data_fs
    You must set the permission for this file system shared storage such that the deep learning workload submission user can read the files from this directory. If your are using Caffe models, the directory structure also needs to be writable. For example:
    chmod -R 755 $DLI_DATA_FS

Complete further setup steps and start installing the product

Follow the instructions in Running the Watson Machine Learning (WML) Accelerator Software Install Module to complete the system setup and start installing IBM Watson Machine Learning Accelerator.