Installing Execution Engine for Apache Hadoop on Spectrum Conductor clusters

Before a project administrator can install Execution Engine for Apache Hadoop on the Spectrum Conductor cluster, the service must first be installed on Cloud Pak for Data.

Review and confirm that you meet the following requirements and are aware of the supported Spectrum conductor versions and platforms before you install the service:

Requirements

System requirements for installing Execution Engine for Apache Hadoop

  • Spectrum Conductor

    • Version 2.5.0 or higher
    • Platform: x86 with RHEL 7.x
    • Anaconda instances setup with Miniconda version 4.6.16
  • Edge node hardware requirements

    • 8 GB memory
    • 2 CPU cores
    • 100 GB disk, mounted and available on /var in the local Linux file system. The installation creates the following directories and these locations are not configurable:
      • To store the logs: /var/log/dsxhi.
      • To store the process IDs: /var/run/dsxhi.
    • 10 GB network interface card recommended for multi-tenant environment.
  • Edge node software requirements

    • Python 2.7 or higher
    • Java JRE 1.8x
  • Ports

    • An external port for the gateway service.
    • An internal port for the Hadoop Integration service.
    • Internal ports for the Jupyter Enterprise Gateway service.
  • Service user requirements

    • The Execution Engine for Apache Hadoop service runs as a service user. This user must be a valid Linux user on the node where the Execution Engine for Apache Hadoop service is installed.

    • If you install the service on multiple edge nodes for high availability, the same service user should be used.

    • This user should have the Cluster administrator role or the Consumer administrator role with context of the root Consumer (/) to be able to create Spectrum Conductor Anaconda environments.

    • Spectrum conductor cluster setup

      The Execution Engine for Apache Hadoop service adds environments to existing Anaconda distribution instances that were defined in Spectrum Conductor. As part of the setup, there must be an Anaconda distribution instance using Conda version 4.6.14, or if you plan to use custom images, you'll need to provide an Anaconda distribution instance that has the same Conda version as the custom image.

      You'll also need to provide the uuid that is associated with the Anaconda distribution instance in the dsxhi_install.conf file, as part of installation configuration.

Installation steps for non-root users

If you plan to install the Execution Engine for Apache Hadoop service as a non-root user, the following permissions should be granted using the visudo command:

Steps for DSXHI non-root installation:

  1. Apply visudo rules for non-root user
  2. su <non-root_user>
  3. sudo yum install <rpm>
  4. sudo chown <non-root_user:non-root_user> -R /opt/ibm/dsxhi/
  5. edit/generate /opt/ibm/dsxhi/conf/dsxhi_install.conf
  6. cd /opt/ibm/dsxhi/bin
  7. sudo python /opt/ibm/dsxhi/bin/install.py

VISUDO template:

## DSXHI
<non-root_user> ALL=(root) NOPASSWD: /usr/bin/yum install <path-to-rpm/rpm>, /usr/bin/yum erase dsxhi*, /usr/bin/chown * /opt/ibm/dsxhi/, /usr/bin/python /opt/ibm/dsxhi/*

Watson Studio interacts with a Spectrum Conductor cluster through the following services:

Service Purpose
Spectrum Conductor REST services Retrieve Anaconda instances, environment names, and instance group information.
Juptyer Enterprise Gateway Submit jobs through Jupyter Enterprise Gateway to Spectrum Spark

Watson Studio user

Every user that is connecting from Watson Studio must be a valid user on the Spectrum Conductor cluster. The recommended way to achieve this is by integrating Watson Studio and the Spectrum Conductor cluster with the same LDAP.

Installing the service

  1. Run the RPM installer. The rpm is installed in /opt/ibm/dsxhi.

  2. If you're running the install as the service user, run sudo chown <serviceuser\> -R /opt/ibm/dsxhi.

  3. Create a /opt/ibm/dsxhi/conf/dsxhi_install.conf file using /opt/ibm/dsxhi/conf/dsxhi_install.conf.template.SPECTRUM file as a reference.

  4. Fill in the dsxhi_install.conf base on your Spectrum conductor configuration. Use the template to help because it describes what is needed for each field. If you need to use your own custom certificates, see Configuring custom certificates.

  5. Optional: If you need to set additional properties to control the location of Java, use a shared truststore, or pass additional Java options. Update the /opt/ibm/dsxhi/conf/dsxhi_env.sh script to include the appropriate values for the environment variables:

    export JAVA="/usr/jdk64/jdk1.8.0_112/bin/java"
    
    export JAVA_CACERTS=/etc/pki/java/cacerts
    
    export DSXHI_JAVA_OPTS="-Djavax.net.ssl.trustStore=$JAVA_CACERTS"
    
  6. In/opt/ibm/dsxhi/bin, run the ./install.py script to install the service. The script prompts for inputs on the following options (alternatively, you can specify the options as flags):

    • Accept the license terms (Hadoop registration uses the same license as Watson Studio). You can also accept the license through the dsxhi_license_acceptance property in dsxhi_install.conf.

    • You are prompted for the password for the Spectrum Conductor REST endpoints. The value can also be passed through the --password flag or -p flag.

    • For the master secret for the gateway service, the value can also be passed through the --dsxhi_gateway_master_password flag or -g flag.

    • The Java cacerts truststore password is prompted. The value can be passed through --dsxhi_gateway_cacerts_password flag or -c flag.

    • Optional: If the custom_jks property in dsxhi_install.conf is used, provide the password associated with this file. The value can be passed through --custom_jks_password or -d flag.

After the service is installed, the necessary components, such as the gateway service, DSXHI integration services, and Jupyter Enterprise Gateway services are started.

Configuring custom certificates

You can use your existing certificates and not have to modify the system truststore. The following configuration properties convert DSXHI to do the following customizations:

custom_jks
DSXHI typically generates a Keystore, converts it to a .crt, and adds the .crt to the Java Truststore. However, with this configuration, DSXHI allows you to provide a custom Keystore that can be used to generate the required .crt.
dsxhi_cacert
DSXHI previously detected the appropriate truststore to use as part of the installation. With the dsxhi_cacert property, DSXHI allows you to provide any custom truststore (CACERTS), where DSXHI certs are added.
add_certs_to_truststore
This configuration provides options to either add the host certificate to the truststore yourself or DSXHI adds it. If you set the configuration to False, users must add the host certificate to the truststore themselves. DSXHI doesn't make any changes to the truststore. If you set the configuration to True, DSXHI retains its default behavior to add host certificate to java truststore and on detected datanodes for gateway and web services.

Learn more

See Uninstalling the service on a Spectrum Conductor cluster for information on uninstalling the service.

Parent topic: Installing Execution Engine for Apache Hadoop