Installing the service on Apache Hadoop clusters

Before a project administrator can install Execution Engine for Apache Hadoop on the Hadoop cluster, the service must be first installed on Cloud Pak for Data. Review and confirm that you meet the following requirements and are aware of the ecosystem services, supported Hadoop versions and platforms before you install the service:

Requirements

System requirements for installing Execution Engine for Apache Hadoop

Requirements for a service user installing the Execution Engine for Apache Hadoop service

If you plan to install the Execution Engine for Apache Hadoop service as the service user, the keytab file must have ownership of the service user.

Service user example

Use the following example to set up proxyuser settings in core-site.xml for a service user:

<property>
   <name>hadoop.proxyuser.svc_dsxhi.hosts</name>
   <value>node1.mycompany.com,node2.mycompany.com</value>
</property>
<property>
   <name>hadoop.proxyuser.svc_dsxhi.groups</name>
   <value>groupa,groupb</value>
</property>

Steps for DSXHI non-root installation:

If you plan to install Execution Engine for Apache Hadoop as an non-root user, you’ll need to grant the non-root user permissions using the visudo command:

  1. Apply visudo rules for non-root user
  2. su <non-root_user>
  3. sudo yum install <rpm>
  4. sudo chown <non-root_user:non-root_user> -R /opt/ibm/dsxhi/
  5. edit/generate /opt/ibm/dsxhi/conf/dsxhi_install.conf
  6. cd /opt/ibm/dsxhi/bin
  7. sudo python /opt/ibm/dsxhi/bin/install.py

VISUDO Template:

## DSXHI
<non-root_user> ALL=(root) NOPASSWD: /usr/bin/yum install <path-to-rpm/rpm>, /usr/bin/yum erase dsxhi*, /usr/bin/chown * /opt/ibm/dsxhi/, /usr/bin/python /opt/ibm/dsxhi/*

Hadoop ecosystem services

Watson Studio interacts with a Hadoop cluster through the following four services:

Service Purpose
WebHDFS Browse and preview HDFS data
WebHCAT Browse and preview Hive data (Watson Studio Local 1.2.x only)
Jupyter Enterprise Gateway Submit jobs to JEG on the Hadoop cluster.
Livy for Spark2 Submit jobs to Spark2 on the Hadoop cluster.

Watson Studio user

Every user that is connecting from Watson Studio must be a valid user on the Hadoop cluster. The recommended way to achieve this is by integrating Watson Studio and the Hadoop cluster with the same LDAP.

Installing the service

  1. Locate the RPM installer at Passport Advantage. You’ll see the image for the service.
  2. Run the RPM installer. The rpm is installed in /opt/ibm/dsxhi.
  3. If you’re running the install as the service user, run sudo chown <serviceuser\> -R /opt/ibm/dsxhi.
  4. Create a /opt/ibm/dsxhi/conf/dsxhi_install.conf file by using /opt/ibm/dsxhi/conf/dsxhi_install.conf.template.HDP, /opt/ibm/dsxhi/conf/dsxhi_install.conf.template.CDH, or /opt/ibm/dsxhi/conf/dsxhi_install.conf.template.SPECTRUM files as a reference. See Template parameters for installing the service on Apache Hadoop clusters for more information on the parameters that you can use from the dsxhi_install.conf.template.HDP, dsxhi_install.conf.template.CDH, /opt/ibm/dsxhi/conf/dsxhi_install.conf.template.SPECTRUM templates.

    Optional: If you need to set additional properties to control the location of Java, use a shared truststore, or pass additional Java options, create a /opt/ibm/dsxhi/conf/dsxhi_env.sh script to export the environment variables:

      export JAVA="/usr/jdk64/jdk1.8.0_112/bin/java"
    
      export JAVA_CACERTS=/etc/pki/java/cacerts
    
      export DSXHI_JAVA_OPTS="-Djavax.net.ssl.trustStore=$JAVA_CACERTS"
    
  5. In/opt/ibm/dsxhi/bin, run the ./install.py script to install the service. The script prompts for inputs on the following options (alternatively, you can specify the options as flags):
    • Accept the license terms (Hadoop registration uses the same license as Watson Studio). You can also accept the license through the dsxhi_license_acceptance property in dsxhi_install.conf.
    • You will be prompted for the password for the cluster administrator of the Ambari or Cloudera Manager. The value can also be passed through the --password flag.
    • For the master secret for the gateway service, the value can also be passed through the --dsxhi_gateway_master_password flag.
    • If the default password for Java cacerts truststore was changed, the password can be passed through the --dsxhi_java_cacerts_password flag.
  6. The installation will run pre-checks to validate the prerequisites. If the cluster_manager_url is not specified in the dsxhi_install.conf file, then the pre-checks on the proxyuser settings will not be performed.

After the service is installed, the necessary components, such as the gateway service and the Hadoop Integration service and optional components (Livy for Spark 2) will be started.

Configuring custom certificates

You can use your existing certificates and not have to modify the system truststore. The following configuration properties convert DSXHI to do the following customizations:

custom_jks
DSXHI typically generates a Keystore, converts it to a .crt, and adds the .crt to the Java Truststore. However, with this configuration, DSXHI allows you to provide a custom Keystore that can be used to generate the required .crt.
dsxhi_cacert
DSXHI previously detected the appropriate truststore to use as part of the installation. With the dsxhi_cacert property, DSXHI allows you to provide any custom truststore (CACERTS), where DSXHI certs are added.
add_certs_to_truststore
This configuration provides options to either add the host certificate to the truststore yourself or DSXHI adds it. If you set the configuration to False, users must add the host certificate to the truststore themselves. DSXHI doesn’t make any changes to the truststore. If you set the configuration to True, DSXHI retains its default behavior to add host certificate to java truststore and on detected datanodes for gateway and web services.

Learn more

See Uninstalling the service on a Hadoop cluster for information on uninstalling the service.