Installing Execution Engine for Apache Hadoop on Spectrum Conductor clusters
Before a project administrator can install Execution Engine for Apache Hadoop on the Spectrum Conductor cluster, the service must first be installed on Cloud Pak for Data.
Review and confirm that you meet the following requirements and are aware of the supported Spectrum conductor versions and platforms before you install the service:
- System requirements
- Installation steps for non-root users
- Installing the service
- Configuring custom certificates
Requirements
System requirements for installing Execution Engine for Apache Hadoop
-
Spectrum Conductor
- Version 2.5.0 or higher
- Platform: x86 with RHEL 7.x
- Anaconda instances setup with Miniconda version 4.6.16
-
Edge node hardware requirements
- 8 GB memory
- 2 CPU cores
- 100 GB disk, mounted and available on /var in the local Linux file system. The installation creates the following directories and these locations are not configurable:
- To store the logs:
/var/log/dsxhi
. - To store the process IDs:
/var/run/dsxhi
.
- To store the logs:
- 10 GB network interface card recommended for multi-tenant environment.
-
Edge node software requirements
- Python 2.7 or higher
- Java JRE 1.8x
-
Ports
- An external port for the gateway service.
- An internal port for the Hadoop Integration service.
- Internal ports for the Jupyter Enterprise Gateway service.
-
Service user requirements
-
The Execution Engine for Apache Hadoop service runs as a service user. This user must be a valid Linux user on the node where the Execution Engine for Apache Hadoop service is installed.
-
If you install the service on multiple edge nodes for high availability, the same service user should be used.
-
This user should have the Cluster administrator role or the Consumer administrator role with context of the root Consumer (
/
) to be able to create Spectrum Conductor Anaconda environments. -
Spectrum conductor cluster setup
The Execution Engine for Apache Hadoop service adds environments to existing Anaconda distribution instances that were defined in Spectrum Conductor. As part of the setup, there must be an Anaconda distribution instance using Conda version 4.6.14, or if you plan to use custom images, you'll need to provide an Anaconda distribution instance that has the same Conda version as the custom image.
You'll also need to provide the uuid that is associated with the Anaconda distribution instance in the
dsxhi_install.conf
file, as part of installation configuration.
-
Installation steps for non-root users
If you plan to install the Execution Engine for Apache Hadoop service as a non-root user, the following permissions should be granted using the visudo command:
Steps for DSXHI non-root installation:
- Apply visudo rules for non-root user
- su
<non-root_user>
- sudo yum install
<rpm>
- sudo chown
<non-root_user:non-root_user> -R /opt/ibm/dsxhi/
- edit/generate
/opt/ibm/dsxhi/conf/dsxhi_install.conf
- cd
/opt/ibm/dsxhi/bin
- sudo python
/opt/ibm/dsxhi/bin/install.py
VISUDO template:
## DSXHI
<non-root_user> ALL=(root) NOPASSWD: /usr/bin/yum install <path-to-rpm/rpm>, /usr/bin/yum erase dsxhi*, /usr/bin/chown * /opt/ibm/dsxhi/, /usr/bin/python /opt/ibm/dsxhi/*
Watson Studio interacts with a Spectrum Conductor cluster through the following services:
Service | Purpose |
---|---|
Spectrum Conductor REST services | Retrieve Anaconda instances, environment names, and instance group information. |
Juptyer Enterprise Gateway | Submit jobs through Jupyter Enterprise Gateway to Spectrum Spark |
Watson Studio user
Every user that is connecting from Watson Studio must be a valid user on the Spectrum Conductor cluster. The recommended way to achieve this is by integrating Watson Studio and the Spectrum Conductor cluster with the same LDAP.
Installing the service
-
Run the RPM installer. The rpm is installed in
/opt/ibm/dsxhi.
-
If you're running the install as the service user, run
sudo chown <serviceuser\> -R /opt/ibm/dsxhi
. -
Create a
/opt/ibm/dsxhi/conf/dsxhi_install.conf
file using/opt/ibm/dsxhi/conf/dsxhi_install.conf.template.SPECTRUM
file as a reference. -
Fill in the
dsxhi_install.conf
base on your Spectrum conductor configuration. Use the template to help because it describes what is needed for each field. If you need to use your own custom certificates, see Configuring custom certificates. -
Optional: If you need to set additional properties to control the location of Java, use a shared truststore, or pass additional Java options. Update the
/opt/ibm/dsxhi/conf/dsxhi_env.sh
script to include the appropriate values for the environment variables:export JAVA="/usr/jdk64/jdk1.8.0_112/bin/java"
export JAVA_CACERTS=/etc/pki/java/cacerts
export DSXHI_JAVA_OPTS="-Djavax.net.ssl.trustStore=$JAVA_CACERTS"
-
In
/opt/ibm/dsxhi/bin
, run the./install.py
script to install the service. The script prompts for inputs on the following options (alternatively, you can specify the options as flags):-
Accept the license terms (Hadoop registration uses the same license as Watson Studio). You can also accept the license through the
dsxhi_license_acceptance
property indsxhi_install.conf
. -
You are prompted for the password for the Spectrum Conductor REST endpoints. The value can also be passed through the
--password
flag or-p
flag. -
For the master secret for the gateway service, the value can also be passed through the
--dsxhi_gateway_master_password
flag or-g
flag. -
The Java cacerts truststore password is prompted. The value can be passed through
--dsxhi_gateway_cacerts_password
flag or-c
flag. -
Optional: If the
custom_jks
property indsxhi_install.conf
is used, provide the password associated with this file. The value can be passed through--custom_jks_password
or-d
flag.
-
After the service is installed, the necessary components, such as the gateway service, DSXHI integration services, and Jupyter Enterprise Gateway services are started.
Configuring custom certificates
You can use your existing certificates and not have to modify the system truststore. The following configuration properties convert DSXHI to do the following customizations:
- custom_jks
- DSXHI typically generates a Keystore, converts it to a
.crt
, and adds the.crt
to the Java Truststore. However, with this configuration, DSXHI allows you to provide a custom Keystore that can be used to generate the required.crt
. - dsxhi_cacert
- DSXHI previously detected the appropriate truststore to use as part of the installation. With the
dsxhi_cacert
property, DSXHI allows you to provide any custom truststore (CACERTS), where DSXHI certs are added. - add_certs_to_truststore
- This configuration provides options to either add the host certificate to the truststore yourself or DSXHI adds it. If you set the configuration to False, users must add the host certificate to the truststore themselves. DSXHI doesn't make any changes to the truststore. If you set the configuration to True, DSXHI retains its default behavior to add host certificate to java truststore and on detected datanodes for gateway and web services.
Learn more
See Uninstalling the service on a Spectrum Conductor cluster for information on uninstalling the service.
Parent topic: Installing Execution Engine for Apache Hadoop