Setting up Execution Engine for Apache Hadoop to work with Watson Studio Local
You can integrate Watson Studio Local with a Hadoop cluster by using the Execution Engine for Apache Hadoop add-on. The add-on can be configured for high availability and allows data scientists to use Data Refinery, Jupyter notebooks, RStudio and Jobs framework in Watson Studio to explore, train and deploy models at scale.
Data scientists can leverage the distributed compute on Hadoop with secure access to the data without needing to move the data out of the Hadoop cluster. The add-on enables data scientists to leverage the Python packages and custom libraries available in Watson Studio when executing on Hadoop without requiring additional packages to be installed on the Hadoop cluster.
- Architecture
- Edge node, port, and service user requirements
- Requirements for a service user installing the Execution Engine for Apache Hadoop add-on
- Hadoop ecosystem services
- Installation and configuration
- Uninstalling the add-on
- Working with Auxillo
- HDP 3.x and Hive
- CDH 6.x and Hive
Architecture

The Execution Engine for Apache Hadoop add-on should be installed on the edge nodes of the Hadoop cluster. The add-on includes services that establish the integration between Watson Studio Local and Hadoop, authenticates requests, and provides remote access to Spark. The add-on requires a service user that has necessary privileges to submit requests on behalf of the Watson Studio users to WebHDFS, WebHCAT, Spark and YARN. It also generates a secure URL for each Watson Studio Local cluster that needs to be integrated with the Hadoop cluster.
Edge node, port, and service user requirements
- Edge node hardware requirements
-
- 8 GB memory
- 2 CPU cores
- 100 GB disk, mounted and available on /var in the local Linux file system.
The installation creates the following directories and these locations are not configurable:
- To store the logs: /var/log/dsxhi, /var/log/livy, and /var/log/livy2.
- To store the process IDs: /var/run/dsxhi, /var/run/livy and /var/run/livy2.
- 10 GB network interface card recommended for multi-tenant environment (1 GB network interface card if WEBHDFS will not be heavily utilized)
- Edge node software requirements
-
- Python 2.7
- CDH only: Have Java Development Kit Version 1.8 installed.
- curl 7.19.7-53 or later.
- HDP only: HDFS Client, YARN Client, Hive Client, Spark/Spark2 Client.
- CDH only: HDFS Gateway Role, YARN Gateway Role, Hive Gateway Role, Spark/Spark2 Gateway Role.
- For clusters with Kerberos security enabled, a SPNEGO keytab.
- For clusters without Kerberos security enabled, write permissions for the yarn user for all directories that a YARN job will write to.
- Ports
-
- An external port for the gateway service.
- An internal port for the Hadoop Integration service.
- Internal ports for Livy for Spark and Livy for Spark 2. This is required if you want only the Execution Engine for Apache Hadoop add-on to install Livy.
- Service user requirements
-
The Execution Engine for Apache Hadoop add-on runs as a service user. If you install the add-on on multiple edge nodes for high availability, the same service user should be used. This user needs to meet the following requirements:
- This user should be a valid Linux user on the node where the Execution Engine for Apache Hadoop add-on is installed.
- This user should have a home directory created in HDFS. The directory should have both owner and group assigned as the service user.
- The service user should have the necessary proxyuser privileges in Hadoop.
- The service user should have the necessary proxyuser privileges in WebHCAT.
- If you're using an existing Livy service running on the Hadoop cluster, the service user should have the necessary super user privileges in Livy services.
- HDP only: If Hadoop or Ranger KMS is enabled, the service user should have necessary
proxyuser privileges in
kms-site.xml. - For a cluster with Kerberos security enabled, the service user should have a keytab file. This eliminates the need for every Watson Studio Local user to have a valid keytab.
Requirements for a service user installing the Execution Engine for Apache Hadoop add-on
## DSXHI - General Installation (replace <service_user>)
<service_user> ALL=(root) NOPASSWD: /usr/bin/yum install dsxhi*, /usr/bin/yum install wshi*, /usr/bin/yum erase dsxhi*, /usr/bin/mkdir -p /etc/dsxhi, /usr/bin/mkdir -p /var/log/dsxhi, /usr/bin/mkdir -p /var/run/dsxhi, /usr/bin/mkdir -p /var/log/livy, /usr/bin/mkdir -p /var/run/livy, /usr/bin/mkdir -p /var/log/livy2, /usr/bin/mkdir -p /var/run/livy2, /usr/bin/chown * /opt/ibm/dsxhi/, /usr/bin/chown * /etc/dsxhi/conf, /usr/bin/chown * /var/log/dsxhi, /usr/bin/chown * /var/run/dsxhi, /usr/bin/chown * /var/log/livy, /usr/bin/chown * /var/run/livy, /usr/bin/chown * /var/log/livy2, /usr/bin/chown * /var/run/livy2, /usr/bin/chmod 400 -R /opt/ibm/dsxhi/security/*, /usr/bin/chmod 755 /var/log/dsxhi, /usr/bin/chmod 755 /var/run/dsxhi, /usr/bin/ln -sf /opt/ibm/dsxhi/gateway/logs /var/log/dsxhi/gateway, /usr/bin/ln -sf /opt/ibm/dsxhi/conf /etc/dsxhi/conf, /usr/bin/ln -sf /var/log/livy /var/log/dsxhi/livy, /usr/bin/ln -sf /var/log/livy2 /var/log/dsxhi/livy2
## DSXHI - Service User Specific Commands (replace <service_user>)
<service_user> ALL=(root) NOPASSWD: /usr/bin/su <service_user> -c hdfs dfs -test -e /user/<service_user>/*, /usr/bin/su <service_user> -c hdfs dfs -test -d /user/<service_user>/*, /usr/bin/su <service_user> -c hdfs dfs -mkdir /user/<service_user>/*, /usr/bin/su <service_user> -c hdfs dfs -chmod 755 /user/<service_user>/*, /usr/bin/su <service_user> -c hdfs dfs -chmod 644 /user/<service_user>/*, /usr/bin/su <service_user> -c hdfs dfs -put -f /opt/ibm/<service_user>/* /user/<service_user>/*, /usr/bin/su <service_user> -c hdfs service_user -rm -r /user/<service_user>/*, /usr/bin/su <service_user> -c sh /opt/ibm/dsxhi/bin/util/gateway_config.sh *
## DSXHI - Security (only needed if Kerberos security is enabled) replace <service_user>, and replace <service_keytab> with the path to the service user keytab)
<service_user> ALL=(root) NOPASSWD: /usr/bin/chown * /opt/ibm/dsxhi/security/*, /usr/bin/su <service_user> -c kinit -kt /opt/ibm/dsxhi/security/* *, /usr/bin/cp /etc/security/keytabs/spnego.service.keytab /opt/ibm/dsxhi/security/*, /usr/bin/su <service_user> -c /usr/bin/kdestroy, /usr/bin/cp <service_keytab> /opt/ibm/dsxhi/security/*
## DSXHI - HDP
<service_user> ALL=(root) NOPASSWD: /usr/sbin/ambari-agent --version, /usr/jdk64/jdk1.8.0_112/bin/keytool -delete -keystore /usr/jdk64/jdk1.8.0_112/jre/lib/security/cacerts *, /usr/jdk64/jdk1.8.0_112/bin/keytool -exportcert -file dsxhi_rest.crt -keystore dsxhi_rest.jks *, /usr/jdk64/jdk1.8.0_112/bin/keytool -import -file dsxhi_rest.crt -keystore /usr/jdk64/jdk1.8.0_112/jre/lib/security/cacerts *
## DSXHI - CDH
<service_user> ALL=(root) NOPASSWD: /usr/java/jdk1.7.0_67-cloudera/bin/keytool -delete -keystore /usr/java/jdk1.7.0_67-cloudera/jre/lib/security/cacerts *, /usr/java/jdk1.7.0_67-cloudera/bin/keytool -exportcert -file dsxhi_rest.crt -keystore dsxhi_rest.jks *, /usr/java/jdk1.7.0_67-cloudera/bin/keytool -import -file dsxhi_rest.crt -keystore /usr/java/jdk1.7.0_67-cloudera/jre/lib/security/cacerts *
Example of the /etc/sudoers file for installed on a HDP cluster with
Kerberos security enabled:
## DSXHI - General Installation
dsxhi ALL=(root) NOPASSWD: /usr/bin/yum install dsxhi*, /usr/bin/yum erase dsxhi*, /usr/bin/mkdir -p /etc/dsxhi, /usr/bin/mkdir -p /var/log/dsxhi, /usr/bin/mkdir -p /var/run/dsxhi, /usr/bin/mkdir -p /var/log/livy, /usr/bin/mkdir -p /var/run/livy, /usr/bin/mkdir -p /var/log/livy2, /usr/bin/mkdir -p /var/run/livy2, /usr/bin/chown * /opt/ibm/dsxhi/, /usr/bin/chown * /etc/dsxhi/conf, /usr/bin/chown * /var/log/dsxhi, /usr/bin/chown * /var/run/dsxhi, /usr/bin/chown * /var/log/livy, /usr/bin/chown * /var/run/livy, /usr/bin/chown * /var/log/livy2, /usr/bin/chown * /var/run/livy2, /usr/bin/chmod 400 -R /opt/ibm/dsxhi/security/*, /usr/bin/chmod 755 /var/log/dsxhi, /usr/bin/chmod 755 /var/run/dsxhi, /usr/bin/ln -sf /opt/ibm/dsxhi/gateway/logs /var/log/dsxhi/gateway, /usr/bin/ln -sf /opt/ibm/dsxhi/conf /etc/dsxhi/conf, /usr/bin/ln -sf /var/log/livy /var/log/dsxhi/livy, /usr/bin/ln -sf /var/log/livy2 /var/log/dsxhi/livy2
## DSXHI - Service User Specific Commands
dsxhi ALL=(root) NOPASSWD: /usr/bin/su dsxhi -c hdfs dfs -test -e /user/dsxhi/*, /usr/bin/su dsxhi -c hdfs dfs -test -d /user/dsxhi/*, /usr/bin/su dsxhi -c hdfs dfs -mkdir /user/dsxhi/*, /usr/bin/su dsxhi -c hdfs dfs -chmod 755 /user/dsxhi/*, /usr/bin/su dsxhi -c hdfs dfs -chmod 644 /user/dsxhi/*, /usr/bin/su dsxhi -c hdfs dfs -put -f /opt/ibm/dsxhi/* /user/dsxhi/*, /usr/bin/su dsxhi -c hdfs dfs -rm -r /user/dsxhi/*, /usr/bin/su dsxhi -c sh /opt/ibm/dsxhi/bin/util/gateway_config.sh *
## DSXHI - Security
dsxhi ALL=(root) NOPASSWD: /usr/bin/chown * /opt/ibm/dsxhi/security/*, /usr/bin/su dsxhi -c kinit -kt /opt/ibm/dsxhi/security/* *, /usr/bin/cp /etc/security/keytabs/spnego.service.keytab /opt/ibm/dsxhi/security/*, /usr/bin/su dsxhi -c /usr/bin/kdestroy, /usr/bin/cp /etc/security/svckeytabs/dsxhi.keytab /opt/ibm/dsxhi/security/*
## DSXHI - HDP
dsxhi ALL=(root) NOPASSWD: /usr/sbin/ambari-agent --version, /usr/jdk64/jdk1.8.0_112/bin/keytool -delete -keystore /usr/jdk64/jdk1.8.0_112/jre/lib/security/cacerts *, /usr/jdk64/jdk1.8.0_112/bin/keytool -exportcert -file dsxhi_rest.crt -keystore dsxhi_rest.jks *, /usr/jdk64/jdk1.8.0_112/bin/keytool -import -file dsxhi_rest.crt -keystore /usr/jdk64/jdk1.8.0_112/jre/lib/security/cacerts *
## DSXHI - CDH
dsxhi ALL=(root) NOPASSWD: /usr/java/jdk1.7.0_67-cloudera/bin/keytool -delete -keystore /usr/java/jdk1.7.0_67-cloudera/jre/lib/security/cacerts *, /usr/java/jdk1.7.0_67-cloudera/bin/keytool -exportcert -file dsxhi_rest.crt -keystore dsxhi_rest.jks *, /usr/java/jdk1.7.0_67-cloudera/bin/keytool -import -file dsxhi_rest.crt -keystore /usr/java/jdk1.7.0_67-cloudera/jre/lib/security/cacerts *re/lib/security/cacerts *, /usr/java/jdk1.7.0_67-cloudera/bin/keytool -exportcert -file dsxhi_rest.crt -keystore dsxhi_rest.jks *, /usr/java/jdk1.7.0_67-cloudera/bin/keytool -import -file dsxhi_rest.crt -keystore /usr/java/jdk1.7.0_67-cloudera/jre/lib/security/cacerts *
Hadoop ecosystem services
| Service | Purpose |
|---|---|
| WebHDFS | Browse and preview HDFS data |
| WebHCAT | Browse and preview Hive data (Watson Studio Local 1.2.x only) |
| Livy for Spark | Submit jobs to Spark on the Hadoop cluster. |
| Livy for Spark2 | Submit jobs to Spark2 on the Hadoop cluster. |
Watson Studio Local user
Every user connecting from Watson Studio Local should be a valid user on the Hadoop cluster. The recommended way to achieve this is by integrating Watson Studio Local and the Hadoop cluster with the same LDAP.
Supported Hadoop versions
- HDP version 2.6.2 and later fixpacks
- HDP version 3.0.1, 3.1
CDH
- CDH version 5.12 and later fixpacks
- CDH version 6.0.1
Platforms supported
The Execution Engine for Apache Hadoop add-on is supported on all x86 and Power platforms supported by HDP and CDH versions listed above.
Installation and configuration
- Download the Execution Engine for Apache Hadoop add-on
- The Execution Engine for Apache Hadoop add-on is located in Passport Advantage. Find the eAssembly part number CJ59WEN in Passport
Advantage to locate this add-on. The Execution Engine for Apache Hadoop add-on rpm is
wshi-<version>-noarch.rpm. If you install the add-on on multiple edge nodes for high availability, the same version of the add-on should be used for all installations. - Install the Execution Engine for Apache Hadoop add-on
-
Note: For Execution Engine for Apache Hadoop add-on version 2.0.1 and later: If Watson Studio Local 2.0 or 2.0.0.1 clusters need to integrate with the Hadoop cluster for Data Refinery jobs, the Hadoop admin must copy additional jars available under /opt/ibm/dsxhi/dist/connectors to the following HDFS location: /user/<service user>/lib/spark/connectors.
- Run the RPM installer. The rpm is installed in
/opt/ibm/dsxhi. - If you're running the install as the service user, run sudo chown <serviceuser> -R /opt/ibm/dsxhi.
- Create a
/opt/ibm/dsxhi/conf/dsxhi_install.conffile using/opt/ibm/dsxhi/conf/dsxhi_install.conf.template.HDPor/opt/ibm/dsxhi/conf/dsxhi_install.conf.template.CDHfile as a reference. - When you install on a Power platform, set
package_installer_tool=yumandpackages=lapackfor the installer to install the necessary packages needed for Python environments.Optional: If you need to set additional properties to control the location of Java, use a shared truststore, or pass additional Java options, create a/opt/ibm/dsxhi/conf/dsxhi_env.shscript to export the environment variables:export JAVA="/usr/jdk64/jdk1.8.0_112/bin/java"export JAVA_CACERTS=/etc/pki/java/cacertsexport DSXHI_JAVA_OPTS="-Djavax.net.ssl.trustStore=$JAVA_CACERTS" - In
/opt/ibm/dsxhi/bin, run the./install.pyscript to install the add-on. The script prompts for inputs on the following options (alternatively, you can specify the options as flags):- Accept the license terms (Hadoop registration uses the same license as Watson Studio Local). You can also accept the license through
the
dsxhi_license_acceptanceproperty indsxhi_install.conf. - If the Ambari URL is specified in
dsxhi_install.conf, you will be prompted for the password for the cluster administrator. The value can also be passed through the--passwordflag. - For the master secret for the gateway service, the value can also be passed through the
--dsxhi_gateway_master_passwordflag. - If the default password for Java cacerts truststore was changed, the password can be passed
through the
--dsxhi_java_cacerts_passwordflag.
- Accept the license terms (Hadoop registration uses the same license as Watson Studio Local). You can also accept the license through
the
- The installation will run pre-checks to validate the prerequisites. If the
cluster_manager_urlis not specified in the dsxhi_install.conf file, then the pre-checks on the proxyuser settings will not be performed.After a successful installation, the necessary components, such as the gateway service and the Hadoop Integration service and optional components (Livy for Spark and Livy for Spark 2) will be started.
- Run the RPM installer. The rpm is installed in
- Adding certificates for SSL-enabled services
-
If WebHDFS SSL is enabled after the add-on is installed, the Hadoop admin should add the certificates for each namenode and datanode to the trust store of the gateway service and update the topology files.
- In
/opt/ibm/dsxhi/bin/util, run ./add_cert.sh https://host:port for each of the namenode and datanodes - Manually update
/opt/ibm/dsxhi/gateway/conf/topologies*.xmlto use the HTTPS URL and Port and use the following example:<service> <role>WEBHDFS</role> <url>https://NamenodeHOST:PORT</url> </service>
- In
- Managing services
-
Periodically, the Hadoop admin must manage the Execution Engine for Apache Hadoop add-on service.
- Check the status of the add-on services
- In /opt/ibm/dsxhi/bin, run
./status.pyto check the status of services.
- Start the add-on services
- In /opt/ibm/dsxhi/bin, run
./start.pyto start the services.
- Stop the add-on services
- In /opt/ibm/dsxhi/bin, run
./stop.pyto stop the services.
- Managing access for Watson Studio Local
-
To maintain control over the access to an Execution Engine for Apache Hadoop add-on service, the Hadoop admin should maintain a list of known Watson Studio Local clusters that can access the add-on. A Watson Studio Local cluster will be known by its URL, which should be passed in when adding to, refreshing or deleting from the known list. A comma-separated list of Watson Studio Local clusters can be passed for each of the operations. Regardless of the order in which the arguments for add and delete are specified, the deletes are applied first and then the adds.
- Add Watson Studio Local clusters to the known list
- In /opt/ibm/dsxhi/bin, run
./manage_known_dsx.py –-add "url1,url2...urlN". Once a Watson Studio Local cluster is added to the known list, the necessary authentication will be setup and a secure URL will be generated for the Watson Studio cluster.
- Refresh the Watson Studio Local cluster to the known list
- If the Watson Studio cluster was re-installed, you can refresh the
information. In /opt/ibm/dsxhi/bin, run
./manage_known_dsx.py –-refresh "url1,url2...urlN".
- Delete a Watson Studio Local cluster from the known list
- In /opt/ibm/dsxhi/bin, run
./manage_known_dsx.py –-delete "url1,url2...urlN".
- View the known list of Watson Studio clusters
- In /opt/ibm/dsxhi/bin, run
./manage_known_dsx.py –listto view the Watson Studio clusters and the associated URL.
After the Hadoop administrator adds a Watson Studio cluster to the known list maintained by the Execution Engine on Apache Hadoop add-on, the Watson Studio Local admin can register the add-on on the Watson Studio cluster using the secure URL and the add-on service user. Learn how to register Hadoop clusters .
Uninstalling the add-on
To uninstall the add-on and remove the files from /opt/ibm/dsxhi, run
the ./uninstall.py script in /opt/ibm/dsxhi/bin. The
uninstallation logs are stored in /var/log/dsxhi,
/var/log/livy, and /var/log/livy2.
If you're uninstalling the add-on from the last edge node on the Hadoop cluster, pass the
--removeHDFSArtifacts switch to remove the libraries and archives on HDFS that are
shared by all the installations.
Logs
- The component logs are stored in /var/log/dsxhi, /var/log/livy, and /var/log/livy2.
- The component PIDs are stored in /var/log/dsxhi, /var/log/livy, and /var/log/livy2, and /opt/ibm/dsxhi/gateway/logs/.
- The log level for the gateway service can be set by editing
/opt/ibm/dsxhi/gateway/conf/gateway-log4j.propertiesand setting the appropriate level in thelog4j.logger.org.apache.knox.gatewayproperty.
Working with Auxillo
To connect to Alluxio using remote Spark with Livy, go
to in the Ambari web client and fs.alluxio.impl
configuration for the remote Spark. See Running Spark on Alluxio
and sparkremote.html#sparkremote__alluxio for details.
HDP 3.x and Hive
Browse and Preview feature of Hive tables is not supported.
CDH 6.x and Hive
Browse and Preview feature of Hive tables is not supported.