Setting up Execution Engine for Apache Hadoop to work with Watson Studio Local

You can integrate Watson Studio Local with a Hadoop cluster by using the Execution Engine for Apache Hadoop add-on. The add-on can be configured for high availability and allows data scientists to use Data Refinery, Jupyter notebooks, RStudio and Jobs framework in Watson Studio to explore, train and deploy models at scale.

Data scientists can leverage the distributed compute on Hadoop with secure access to the data without needing to move the data out of the Hadoop cluster. The add-on enables data scientists to leverage the Python packages and custom libraries available in Watson Studio when executing on Hadoop without requiring additional packages to be installed on the Hadoop cluster.

Architecture
Edge node, port, and service user requirements
Requirements for a service user installing the Execution Engine for Apache Hadoop add-on
Hadoop ecosystem services
Installation and configuration
Uninstalling the add-on
Working with Auxillo
HDP 3.x and Hive
CDH 6.x and Hive

Architecture

The Execution Engine for Apache Hadoop add-on should be installed on the edge nodes of the Hadoop cluster. The add-on includes services that establish the integration between Watson Studio Local and Hadoop, authenticates requests, and provides remote access to Spark. The add-on requires a service user that has necessary privileges to submit requests on behalf of the Watson Studio users to WebHDFS, WebHCAT, Spark and YARN. It also generates a secure URL for each Watson Studio Local cluster that needs to be integrated with the Hadoop cluster.

Important: The set up tasks must be performed by a Hadoop administrator.

Edge node, port, and service user requirements

Edge node hardware requirements

8 GB memory
2 CPU cores
100 GB disk, mounted and available on /var in the local Linux file system. The installation creates the following directories and these locations are not configurable:
- To store the logs: /var/log/dsxhi, /var/log/livy, and /var/log/livy2.
- To store the process IDs: /var/run/dsxhi, /var/run/livy and /var/run/livy2.
10 GB network interface card recommended for multi-tenant environment (1 GB network interface card if WEBHDFS will not be heavily utilized)

Edge node software requirements

Python 2.7
CDH only: Have Java Development Kit Version 1.8 installed.
curl 7.19.7-53 or later.
HDP only: HDFS Client, YARN Client, Hive Client, Spark/Spark2 Client.
CDH only: HDFS Gateway Role, YARN Gateway Role, Hive Gateway Role, Spark/Spark2 Gateway Role.
For clusters with Kerberos security enabled, a SPNEGO keytab.
For clusters without Kerberos security enabled, write permissions for the yarn user for all directories that a YARN job will write to.

Ports

An external port for the gateway service.
An internal port for the Hadoop Integration service.
Internal ports for Livy for Spark and Livy for Spark 2. This is required if you want only the Execution Engine for Apache Hadoop add-on to install Livy.

Service user requirements

The Execution Engine for Apache Hadoop add-on runs as a service user. If you install the add-on on multiple edge nodes for high availability, the same service user should be used. This user needs to meet the following requirements:

This user should be a valid Linux user on the node where the Execution Engine for Apache Hadoop add-on is installed.
This user should have a home directory created in HDFS. The directory should have both owner and group assigned as the service user.
The service user should have the necessary proxyuser privileges in Hadoop.
The service user should have the necessary proxyuser privileges in WebHCAT.
If you're using an existing Livy service running on the Hadoop cluster, the service user should have the necessary super user privileges in Livy services.
HDP only: If Hadoop or Ranger KMS is enabled, the service user should have necessary proxyuser privileges in kms-site.xml.
For a cluster with Kerberos security enabled, the service user should have a keytab file. This eliminates the need for every Watson Studio Local user to have a valid keytab.

Requirements for a service user installing the Execution Engine for Apache Hadoop add-on

If you plan to install the Execution Engine for Apache Hadoop add-on as the service user, the following permissions should be granted using the visudo command:

## DSXHI - General Installation (replace <service_user>)
<service_user> ALL=(root) NOPASSWD: /usr/bin/yum install dsxhi*, /usr/bin/yum install wshi*, /usr/bin/yum erase dsxhi*, /usr/bin/mkdir -p /etc/dsxhi, /usr/bin/mkdir -p /var/log/dsxhi, /usr/bin/mkdir -p /var/run/dsxhi, /usr/bin/mkdir -p /var/log/livy, /usr/bin/mkdir -p /var/run/livy, /usr/bin/mkdir -p /var/log/livy2, /usr/bin/mkdir -p /var/run/livy2, /usr/bin/chown * /opt/ibm/dsxhi/, /usr/bin/chown * /etc/dsxhi/conf, /usr/bin/chown * /var/log/dsxhi, /usr/bin/chown * /var/run/dsxhi, /usr/bin/chown * /var/log/livy, /usr/bin/chown * /var/run/livy, /usr/bin/chown * /var/log/livy2, /usr/bin/chown * /var/run/livy2, /usr/bin/chmod 400 -R /opt/ibm/dsxhi/security/*, /usr/bin/chmod 755 /var/log/dsxhi, /usr/bin/chmod 755 /var/run/dsxhi, /usr/bin/ln -sf /opt/ibm/dsxhi/gateway/logs /var/log/dsxhi/gateway, /usr/bin/ln -sf /opt/ibm/dsxhi/conf /etc/dsxhi/conf, /usr/bin/ln -sf /var/log/livy /var/log/dsxhi/livy, /usr/bin/ln -sf /var/log/livy2 /var/log/dsxhi/livy2

## DSXHI - Service User Specific Commands (replace <service_user>)
<service_user> ALL=(root) NOPASSWD:  /usr/bin/su <service_user> -c hdfs dfs -test -e /user/<service_user>/*, /usr/bin/su <service_user> -c hdfs dfs -test -d /user/<service_user>/*, /usr/bin/su <service_user> -c hdfs dfs -mkdir /user/<service_user>/*, /usr/bin/su <service_user> -c hdfs dfs -chmod 755 /user/<service_user>/*, /usr/bin/su <service_user> -c hdfs dfs -chmod 644 /user/<service_user>/*, /usr/bin/su <service_user> -c hdfs dfs -put -f /opt/ibm/<service_user>/* /user/<service_user>/*, /usr/bin/su <service_user> -c hdfs service_user -rm -r /user/<service_user>/*,  /usr/bin/su <service_user> -c sh /opt/ibm/dsxhi/bin/util/gateway_config.sh *

## DSXHI - Security (only needed if Kerberos security is enabled) replace <service_user>, and replace <service_keytab> with the path to the service user keytab)
<service_user> ALL=(root) NOPASSWD: /usr/bin/chown * /opt/ibm/dsxhi/security/*, /usr/bin/su <service_user> -c kinit -kt /opt/ibm/dsxhi/security/* *, /usr/bin/cp /etc/security/keytabs/spnego.service.keytab /opt/ibm/dsxhi/security/*, /usr/bin/su <service_user> -c /usr/bin/kdestroy, /usr/bin/cp <service_keytab> /opt/ibm/dsxhi/security/*

## DSXHI - HDP 
<service_user> ALL=(root) NOPASSWD: /usr/sbin/ambari-agent --version, /usr/jdk64/jdk1.8.0_112/bin/keytool -delete -keystore /usr/jdk64/jdk1.8.0_112/jre/lib/security/cacerts *, /usr/jdk64/jdk1.8.0_112/bin/keytool -exportcert -file dsxhi_rest.crt -keystore dsxhi_rest.jks *, /usr/jdk64/jdk1.8.0_112/bin/keytool -import -file dsxhi_rest.crt -keystore /usr/jdk64/jdk1.8.0_112/jre/lib/security/cacerts *

## DSXHI - CDH 
<service_user> ALL=(root) NOPASSWD: /usr/java/jdk1.7.0_67-cloudera/bin/keytool -delete -keystore /usr/java/jdk1.7.0_67-cloudera/jre/lib/security/cacerts *, /usr/java/jdk1.7.0_67-cloudera/bin/keytool -exportcert -file dsxhi_rest.crt -keystore dsxhi_rest.jks *, /usr/java/jdk1.7.0_67-cloudera/bin/keytool -import -file dsxhi_rest.crt -keystore /usr/java/jdk1.7.0_67-cloudera/jre/lib/security/cacerts *

Example of the /etc/sudoers file for installed on a HDP cluster with Kerberos security enabled:

## DSXHI - General Installation
dsxhi ALL=(root) NOPASSWD: /usr/bin/yum install dsxhi*, /usr/bin/yum erase dsxhi*, /usr/bin/mkdir -p /etc/dsxhi, /usr/bin/mkdir -p /var/log/dsxhi, /usr/bin/mkdir -p /var/run/dsxhi, /usr/bin/mkdir -p /var/log/livy, /usr/bin/mkdir -p /var/run/livy, /usr/bin/mkdir -p /var/log/livy2, /usr/bin/mkdir -p /var/run/livy2, /usr/bin/chown * /opt/ibm/dsxhi/, /usr/bin/chown * /etc/dsxhi/conf, /usr/bin/chown * /var/log/dsxhi, /usr/bin/chown * /var/run/dsxhi, /usr/bin/chown * /var/log/livy, /usr/bin/chown * /var/run/livy, /usr/bin/chown * /var/log/livy2, /usr/bin/chown * /var/run/livy2, /usr/bin/chmod 400 -R /opt/ibm/dsxhi/security/*, /usr/bin/chmod 755 /var/log/dsxhi, /usr/bin/chmod 755 /var/run/dsxhi, /usr/bin/ln -sf /opt/ibm/dsxhi/gateway/logs /var/log/dsxhi/gateway, /usr/bin/ln -sf /opt/ibm/dsxhi/conf /etc/dsxhi/conf, /usr/bin/ln -sf /var/log/livy /var/log/dsxhi/livy, /usr/bin/ln -sf /var/log/livy2 /var/log/dsxhi/livy2

## DSXHI - Service User Specific Commands
dsxhi ALL=(root) NOPASSWD:  /usr/bin/su dsxhi -c hdfs dfs -test -e /user/dsxhi/*, /usr/bin/su dsxhi -c hdfs dfs -test -d /user/dsxhi/*, /usr/bin/su dsxhi -c hdfs dfs -mkdir /user/dsxhi/*, /usr/bin/su dsxhi -c hdfs dfs -chmod 755 /user/dsxhi/*, /usr/bin/su dsxhi -c hdfs dfs -chmod 644 /user/dsxhi/*, /usr/bin/su dsxhi -c hdfs dfs -put -f /opt/ibm/dsxhi/* /user/dsxhi/*, /usr/bin/su dsxhi -c hdfs dfs -rm -r /user/dsxhi/*,  /usr/bin/su dsxhi -c sh /opt/ibm/dsxhi/bin/util/gateway_config.sh *

## DSXHI - Security
dsxhi ALL=(root) NOPASSWD: /usr/bin/chown * /opt/ibm/dsxhi/security/*, /usr/bin/su dsxhi -c kinit -kt /opt/ibm/dsxhi/security/* *, /usr/bin/cp /etc/security/keytabs/spnego.service.keytab /opt/ibm/dsxhi/security/*, /usr/bin/su dsxhi -c /usr/bin/kdestroy, /usr/bin/cp /etc/security/svckeytabs/dsxhi.keytab /opt/ibm/dsxhi/security/*

## DSXHI - HDP
dsxhi ALL=(root) NOPASSWD: /usr/sbin/ambari-agent --version, /usr/jdk64/jdk1.8.0_112/bin/keytool -delete -keystore /usr/jdk64/jdk1.8.0_112/jre/lib/security/cacerts *, /usr/jdk64/jdk1.8.0_112/bin/keytool -exportcert -file dsxhi_rest.crt -keystore dsxhi_rest.jks *, /usr/jdk64/jdk1.8.0_112/bin/keytool -import -file dsxhi_rest.crt -keystore /usr/jdk64/jdk1.8.0_112/jre/lib/security/cacerts *

## DSXHI - CDH
dsxhi ALL=(root) NOPASSWD: /usr/java/jdk1.7.0_67-cloudera/bin/keytool -delete -keystore /usr/java/jdk1.7.0_67-cloudera/jre/lib/security/cacerts *, /usr/java/jdk1.7.0_67-cloudera/bin/keytool -exportcert -file dsxhi_rest.crt -keystore dsxhi_rest.jks *, /usr/java/jdk1.7.0_67-cloudera/bin/keytool -import -file dsxhi_rest.crt -keystore /usr/java/jdk1.7.0_67-cloudera/jre/lib/security/cacerts *re/lib/security/cacerts *, /usr/java/jdk1.7.0_67-cloudera/bin/keytool -exportcert -file dsxhi_rest.crt -keystore dsxhi_rest.jks *, /usr/java/jdk1.7.0_67-cloudera/bin/keytool -import -file dsxhi_rest.crt -keystore /usr/java/jdk1.7.0_67-cloudera/jre/lib/security/cacerts *

Hadoop ecosystem services

Watson Studio Local interacts with a Hadoop cluster through the following four services:

Service	Purpose
WebHDFS	Browse and preview HDFS data
WebHCAT	Browse and preview Hive data (Watson Studio Local 1.2.x only)
Livy for Spark	Submit jobs to Spark on the Hadoop cluster.
Livy for Spark2	Submit jobs to Spark2 on the Hadoop cluster.

Watson Studio Local user

Every user connecting from Watson Studio Local should be a valid user on the Hadoop cluster. The recommended way to achieve this is by integrating Watson Studio Local and the Hadoop cluster with the same LDAP.

Supported Hadoop versions

HDP

HDP version 2.6.2 and later fixpacks
HDP version 3.0.1, 3.1

CDH

CDH version 5.12 and later fixpacks
CDH version 6.0.1

Platforms supported

The Execution Engine for Apache Hadoop add-on is supported on all x86 and Power platforms supported by HDP and CDH versions listed above.

Installation and configuration

Download the Execution Engine for Apache Hadoop add-on

The Execution Engine for Apache Hadoop add-on is located in Passport Advantage. Find the eAssembly part number CJ59WEN in Passport Advantage to locate this add-on. The Execution Engine for Apache Hadoop add-on rpm is wshi-<version>-noarch.rpm. If you install the add-on on multiple edge nodes for high availability, the same version of the add-on should be used for all installations.

Install the Execution Engine for Apache Hadoop add-on

Note: For Execution Engine for Apache Hadoop add-on version 2.0.1 and later: If Watson Studio Local 2.0 or 2.0.0.1 clusters need to integrate with the Hadoop cluster for Data Refinery jobs, the Hadoop admin must copy additional jars available under /opt/ibm/dsxhi/dist/connectors to the following HDFS location: /user/<service user>/lib/spark/connectors.

Run the RPM installer. The rpm is installed in /opt/ibm/dsxhi.
If you're running the install as the service user, run sudo chown <serviceuser> -R /opt/ibm/dsxhi.
Create a /opt/ibm/dsxhi/conf/dsxhi_install.conf file using /opt/ibm/dsxhi/conf/dsxhi_install.conf.template.HDP or/opt/ibm/dsxhi/conf/dsxhi_install.conf.template.CDH file as a reference.
When you install on a Power platform, set package_installer_tool=yum and packages=lapack for the installer to install the necessary packages needed for Python environments.
Optional: If you need to set additional properties to control the location of Java, use a shared truststore, or pass additional Java options, create a /opt/ibm/dsxhi/conf/dsxhi_env.sh script to export the environment variables:
```
export JAVA="/usr/jdk64/jdk1.8.0_112/bin/java"
```
```
export JAVA_CACERTS=/etc/pki/java/cacerts
```
```
export DSXHI_JAVA_OPTS="-Djavax.net.ssl.trustStore=$JAVA_CACERTS"
```
In /opt/ibm/dsxhi/bin, run the ./install.py script to install the add-on. The script prompts for inputs on the following options (alternatively, you can specify the options as flags):
- Accept the license terms (Hadoop registration uses the same license as Watson Studio Local). You can also accept the license through thedsxhi_license_acceptance property in dsxhi_install.conf.
- If the Ambari URL is specified in dsxhi_install.conf, you will be prompted for the password for the cluster administrator. The value can also be passed through the --password flag.
- For the master secret for the gateway service, the value can also be passed through the --dsxhi_gateway_master_password flag.
- If the default password for Java cacerts truststore was changed, the password can be passed through the --dsxhi_java_cacerts_password flag.
The installation will run pre-checks to validate the prerequisites. If the cluster_manager_url is not specified in the dsxhi_install.conf file, then the pre-checks on the proxyuser settings will not be performed.
After a successful installation, the necessary components, such as the gateway service and the Hadoop Integration service and optional components (Livy for Spark and Livy for Spark 2) will be started.

Adding certificates for SSL-enabled services

If WebHDFS SSL is enabled after the add-on is installed, the Hadoop admin should add the certificates for each namenode and datanode to the trust store of the gateway service and update the topology files.

In /opt/ibm/dsxhi/bin/util, run ./add_cert.sh https://host:port for each of the namenode and datanodes
Manually update /opt/ibm/dsxhi/gateway/conf/topologies*.xml to use the HTTPS URL and Port and use the following example:
```
<service>
  <role>WEBHDFS</role>
  <url>https://NamenodeHOST:PORT</url>
</service>
```

Managing services

Periodically, the Hadoop admin must manage the Execution Engine for Apache Hadoop add-on service.

Check the status of the add-on services: In /opt/ibm/dsxhi/bin, run ./status.py to check the status of services.

Start the add-on services: In /opt/ibm/dsxhi/bin, run ./start.py to start the services.

Stop the add-on services: In /opt/ibm/dsxhi/bin, run ./stop.py to stop the services.

Managing access for Watson Studio Local

To maintain control over the access to an Execution Engine for Apache Hadoop add-on service, the Hadoop admin should maintain a list of known Watson Studio Local clusters that can access the add-on. A Watson Studio Local cluster will be known by its URL, which should be passed in when adding to, refreshing or deleting from the known list. A comma-separated list of Watson Studio Local clusters can be passed for each of the operations. Regardless of the order in which the arguments for add and delete are specified, the deletes are applied first and then the adds.

Add Watson Studio Local clusters to the known list: In /opt/ibm/dsxhi/bin, run ./manage_known_dsx.py –-add "url1,url2...urlN". Once a Watson Studio Local cluster is added to the known list, the necessary authentication will be setup and a secure URL will be generated for the Watson Studio cluster.

Refresh the Watson Studio Local cluster to the known list: If the Watson Studio cluster was re-installed, you can refresh the information. In /opt/ibm/dsxhi/bin, run ./manage_known_dsx.py –-refresh "url1,url2...urlN".

Delete a Watson Studio Local cluster from the known list: In /opt/ibm/dsxhi/bin, run ./manage_known_dsx.py –-delete "url1,url2...urlN".

View the known list of Watson Studio clusters: In /opt/ibm/dsxhi/bin, run ./manage_known_dsx.py –list to view the Watson Studio clusters and the associated URL.

After the Hadoop administrator adds a Watson Studio cluster to the known list maintained by the Execution Engine on Apache Hadoop add-on, the Watson Studio Local admin can register the add-on on the Watson Studio cluster using the secure URL and the add-on service user. Learn how to register Hadoop clusters .

Uninstalling the add-on

To uninstall the add-on and remove the files from /opt/ibm/dsxhi, run the ./uninstall.py script in /opt/ibm/dsxhi/bin. The uninstallation logs are stored in /var/log/dsxhi, /var/log/livy, and /var/log/livy2.

If you're uninstalling the add-on from the last edge node on the Hadoop cluster, pass the --removeHDFSArtifacts switch to remove the libraries and archives on HDFS that are shared by all the installations.

Logs

The component logs are stored in /var/log/dsxhi, /var/log/livy, and /var/log/livy2.
The component PIDs are stored in /var/log/dsxhi, /var/log/livy, and /var/log/livy2, and /opt/ibm/dsxhi/gateway/logs/.
The log level for the gateway service can be set by editing /opt/ibm/dsxhi/gateway/conf/gateway-log4j.properties and setting the appropriate level in the log4j.logger.org.apache.knox.gateway property.

Working with Auxillo

To connect to Alluxio using remote Spark with Livy, go toAmbari > HDFS > Config > Custom Core-site > Add property in the Ambari web client and fs.alluxio.impl configuration for the remote Spark. See Running Spark on Alluxio and sparkremote.html#sparkremote__alluxio for details.

HDP 3.x and Hive

Browse and Preview feature of Hive tables is not supported.

CDH 6.x and Hive

Browse and Preview feature of Hive tables is not supported.