Installing Execution Engine for Apache Hadoop on Apache Hadoop clusters

Before a project administrator can install the Execution Engine for Apache Hadoop cluster, the service must first be installed on Cloud Pak for Data.

Review and confirm that you meet the following requirements are aware of the supported Hadoop versions and platforms before you install the service:

System requirements

Edge node hardware requirements

  • 8 GB memory
  • 2 CPU cores
  • 100 GB disk, mounted and available on /var in the local Linux file system. The installation creates the following directories and these locations are not configurable:
    • To store the logs: /var/log/dsxhi.
    • To store the process IDs: /var/run/dsxhi and /var/run/livy.
  • 10 GB network interface card recommended for multi-tenant environment (1 GB network interface card if WEBHDFS will not be heavily utilized)

Edge node software requirements

  • Python 3.8 or higher
  • curl 7.19.7-53 or later
  • For clusters with Kerberos security enabled, a user keytab and SPNEGO (HTTP) keytab.
  • For clusters without Kerberos security enabled, write permissions for the yarn user for all directories that a YARN job will write to.

Ports

  • An external port for the gateway service.
  • An internal port for the Hadoop Integration service.
  • Internal ports for Livy and Jupyter Enterprise Gateway service for Spark 3.0. This is required if you only want the Execution Engine for Apache Hadoop service to install Livy.

Configuration requirements

  • AUTO TLS is required to run the Jupyter Enterprise Gateway (JEG) service and connect to Spark.

Supported Hadoop versions

  • CDP version 7.0, 7.1 - 7.1.8

Support for the preceding versions of Cloudera Data Platform (CDP) are contingent on public support provided by Cloudera. If a version of CDP reaches its end of service (EoS) date, the version will not be supported in Execution Engine for Apache Hadoop.

Platforms supported

The Execution Engine for Apache Hadoop service is supported on all x86 platforms supported by CDP versions listed above.

Service user requirements

The Execution Engine for Apache Hadoop service runs as a service user. If you install the service on multiple edge nodes for high availability, the same service user should be used. This user must meet the following requirements:

  • A valid Linux user on the node where the Execution Engine for Apache Hadoop service is installed.

  • Home directory created in HDFS. The directory should have both owner and group assigned as the service user.

  • Necessary proxy user privileges in Hadoop. For example, to configure a proxy service user as a proxy user for CDP, you’ll need to set the following values in the core-site.xml properties for the service user:

    hadoop.proxyuser.<proxy_user>.hosts
    hadoop.proxyuser.<proxy_group>.groups hadoop.proxyuser.<proxy_user>.users
    

    See an example of a service user here.

Note: For more information on configuring proxy users for HDFS, see the Cloudera documentation.

  • If you're using an existing Livy service running on the Hadoop cluster, the service user should have the necessary super user privileges in Livy services.
  • For a cluster with Kerberos security enabled, you must generate two keytabs to install Execution Engine for Apache Hadoop. The keytab files must have ownership of the service user because it eliminates the need for every Watson Studio user to have a valid keytab.

Use the following example as a guide to help you generate a keytab after the Key Distribution Center is configured:

# assumptions for following example
service_user: dsxhi
host_FQDN: ak-cdh716-edge-1.fyre.ibm.com

# service_user keytab
kadmin.local -q "addprinc dsxhi"
kadmin.local -q "xst -norandkey -k dsxhi.keytab dsxhi"

# SPNEGO keytab
kadmin.local -q "addprinc -randkey HTTP/ak-cdh716-edge-1.fyre.ibm.com"
kadmin.local -q "xst -norandkey -k spnego.ak-cdh716-edge-1.keytab HTTP/ak-cdh716-edge-1.fyre.ibm.com"

Installing the Execution Engine for Apache Hadoop service as a service user or non-root user

Installing Execution Engine for Apache Hadoop as a service user

To install the Execution Engine for Apache Hadoop service as a service user, the keytab file must have ownership of the service user.

Service user example

Use the following example to set up proxyuser settings in core-site.xml for a service user:

<property>
   <name>hadoop.proxyuser.svc_dsxhi.hosts</name>
   <value>node1.mycompany.com,node2.mycompany.com</value>
</property>
<property>
   <name>hadoop.proxyuser.svc_dsxhi.groups</name>
   <value>groupa,groupb</value>
</property>

Installing Execution Engine for Apache Hadoop as a non-root user

To install Execution Engine for Apache Hadoop as a non-root user, you must grant the non-root user permissions using the visudo command:

  1. Apply visudo rules for non-root user
  2. su <non-root_user>
  3. sudo yum install <rpm>
  4. sudo chown <non-root_user:non-root_user> -R /opt/ibm/dsxhi/
  5. edit/generate /opt/ibm/dsxhi/conf/dsxhi_install.conf
  6. cd /opt/ibm/dsxhi/bin
  7. sudo python /opt/ibm/dsxhi/bin/install.py

VISUDO Template:

## DSXHI
<non-root_user> ALL=(root) NOPASSWD: /usr/bin/yum install <path-to-rpm/rpm>, /usr/bin/yum erase dsxhi*, /usr/bin/chown * /opt/ibm/dsxhi/, /usr/bin/python /opt/ibm/dsxhi/*

Installing the service

Important: If your Execution Engine for Apache Hadoop version is 4.0.8 or later, you must have the Execution Engine for Apache Hadoop RPM for 4.0.8 or later installed on the Hadoop cluster.

  1. Locate the RPM installer at Passport Advantage. You'll see the image for the service.
  2. Run the RPM installer. The rpm is installed in /opt/ibm/dsxhi.
  3. If you're running the install as the service user, run sudo chown <serviceuser\> -R /opt/ibm/dsxhi.
  4. Create a /opt/ibm/dsxhi/conf/dsxhi_install.conf file by using /opt/ibm/dsxhi/conf/dsxhi_install.conf.template.CDH or /opt/ibm/dsxhi/conf/dsxhi_install.conf.template.SPECTRUM files as a reference. See Template parameters for installing the service on Apache Hadoop clusters for more information on the parameters that you can use from the dsxhi_install.conf.template.CDH and /opt/ibm/dsxhi/conf/dsxhi_install.conf.template.SPECTRUM templates.

Optional: If you need to set additional properties to control the location of Java, use a shared truststore or pass additional Java options and create a /opt/ibm/dsxhi/conf/dsxhi_env.sh script to export the environment variables:

export JAVA="/usr/jdk64/jdk1.8.0_112/bin/java"
export JAVA_CACERTS=/etc/pki/java/cacerts
export DSXHI_JAVA_OPTS="-Djavax.net.ssl.trustStore=$JAVA_CACERTS"
  1. In/opt/ibm/dsxhi/bin, run the ./install.py script to install the service. The script prompts for inputs on the following options (alternatively, you can specify the options as flags):

    1. Accept the license terms (Hadoop registration uses the same license as Watson Studio). You can also accept the license through the dsxhi_license_acceptance property in dsxhi_install.conf.
    2. You will be prompted for the password for the cluster administrator or operator of the Ambari or Cloudera Manager. The value can also be passed through the --password flag.
      1. If you are using non-admin role, such as an operator, complete the following steps:
        1. Create a user with an operator role on a Cloudera cluster, if not already created.
        2. Set the value for the cluster_admin property in dsxhi_install.conf to the Cloudera Manager operator.
    3. For the master secret for the gateway service, the value can also be passed through the --dsxhi_gateway_master_password flag.
    4. If the default password for Java cacerts truststore was changed, the password can be passed through the --dsxhi_java_cacerts_password flag.
  2. The installation will run pre-checks to validate the prerequisites. If the cluster_manager_url is not specified in the dsxhi_install.conf file, then the pre-checks on the proxyuser settings will not be performed.

After the service is installed, the necessary components, such as the gateway service and the Hadoop Integration service and optional components (Livy for Spark) will be started.

  • Adding certificates for SSL-enabled services

    If WebHDFS SSL is enabled after the service is installed, the Hadoop admin should add the certificates for each namenode and datanode to the trust store of the gateway service and update the topology files.

    • In /opt/ibm/dsxhi/bin/util, run ./add_cert.sh https://host:port <cacert_password> for each of the namenode and datanodes

    • Manually update/opt/ibm/dsxhi/gateway/conf/topologies*.xml to use the HTTPS URL and Port and use the following example:

      <service>
        <role>WEBHDFS</role>
        <url>https://NamenodeHOST:PORT</url>
      </service>
      
  • Adding conda channels for Cloud Pak for Data image push operations and dynamic package installation

    When a Cloud Pak for Data administrator pushes a runtime image to Hadoop using the Hadoop push type, package resolution occurs using conda rc files that are installed with the HI rpm. The conda rc files exist at the following locations. These files are also used by the "hi_core_utils.install_packages()" utility method.

    • /user/<dsxhi-svc-user>/environments/conda/conda_rc_x86_64.yaml
    • /user/<dsxhi-svc-user>/environments/conda/conda_rc_ppc64le.yaml

    A Hadoop admin can edit the conda rc files to add more channels, which is useful when:

    • The pushable runtime images from Cloud Pak for Data require additional channels for package resolution. The Hadoop administrator might need to do this if the Hadoop image push operations are failing with package resolution errors.

    • One or more packages that a user would like to install with "hi_core_utils.install_packages()" requires additional channels. If the Hadoop administrator wants to expose those channels for all users of the system, the administrator can add the necessary channels to the conda rc files indicated above.

    • When editing the files, the Hadoop administrator should ensure that the file continues to be owned by the DSXHI service user and retains 644 permissions.

    Note: In general, adding channels to these conda rc files increases the time it takes to resolve packages and can increase the time it takes for the relevant operations to complete.

Restoring the installation from a previous release

Use the following procedures if installing the new release of the service fails. The procedure guides you in restoring the previous release of the service.

  1. Run the uninstall script on the newer version:

    cd /opt/ibm/dsxhi/bin
    ./uninstall.py
    
  2. Create a backup of the install.conf file:

    cp /opt/ibm/dsxhi/conf/dsxhi_install.conf /tmp/dsxhi_install.conf
    
  3. Remove the installation package:

    yum erase -y dsxhi
    
  4. Install the old HEE package:

    yum install -y <old_dsxhi_rpm_file>
    
  5. Replace the dsxhi_install.conf in the conf directory:

    cp /tmp/dsxhi_install.conf /opt/ibm/dsxhi/conf/
    
  6. Run the install:

    cd /opt/ibm/dsxhi/bin
    ./install.py
    
  7. Add the Watson Studio cluster URL:

    cd /opt/ibm/dsxhi/bin
    ./manage_known_dsx -a <wsl_cluster_URL_to_add>
    
  8. From Watson Studio, register the service from the Systems Integration tab in the Platform Configuration section.

Configuring custom certificates

You can use your existing certificates and not have to modify the system truststore. The following configuration properties convert DSXHI to do the following customizations:

custom_jks
DSXHI typically generates a Keystore, converts it to a .crt, and adds the .crt to the Java Truststore. However, with this configuration, DSXHI allows you to provide a custom Keystore that can be used to generate the required .crt.
dsxhi_cacert
DSXHI previously detected the appropriate truststore to use as part of the installation. With the dsxhi_cacert property, DSXHI allows you to provide any custom truststore (CACERTS), where DSXHI certs are added.
add_certs_to_truststore
This configuration provides options to either add the host certificate to the truststore yourself or DSXHI adds it. If you set the configuration to False, users must add the host certificate to the truststore themselves. DSXHI doesn't make any changes to the truststore. If you set the configuration to True, DSXHI retains its default behavior to add host certificate to java truststore and on detected datanodes for gateway and web services.

Learn more

Parent topic: Installing Execution Engine for Apache Hadoop