Installing Execution Engine for Apache Hadoop on Apache Hadoop clusters

To integrate Watson Studio with your remote Apache Hadoop cluster, install and configure the Execution Engine for Apache Hadoop service on your Hadoop cluster.

System requirements
Installing the service as a root user or non-root user
Installing the service
Restoring the installation from a previous release
Configuring custom certificates

Before you begin

You must complete the following prerequisites:

Install Execution Engine on both Cloud Pak for Data and the Hadoop cluster. For more information about installing Execution Engine for Apache Hadoop on Cloud Pak for Data, see Installing Execution Engine for Apache Hadoop.
Review and confirm that you meet the system requirements and are aware of the supported Hadoop versions and platforms before you install the service.

System requirements

You must ensure the following system requirements are met before installing Execution Engine for Apache Hadoop clusters.

Edge node hardware requirements

8 GB memory

2 CPU cores

100 GB disk, mounted and available on /var in the local Linux file system. The installation creates the following directories and these locations are not configurable:
- To store the logs: /var/log/dsxhi.
- To store the process IDs: /var/run/dsxhi and /var/run/livy.

10 GB network interface card recommended for multi-tenant environment (1 GB network interface card if WEBHDFS will not be heavily utilized)

Edge node software requirements

Java 1.8

Python 3.8 or higher

curl 7.19.7-53 or later

For clusters with Kerberos security enabled, a user keytab and SPNEGO (HTTP) keytab.

For clusters without Kerberos security enabled, write permissions for the yarn user for all directories that a YARN job will write to.

To ensure compatibility with Runtime 25.1, Jupyter Enterprise Gateway (JEG) requires a Cloudera Hadoop cluster that runs on Red Hat Enterprise Linux 9. Before you install JEG, verify that the following RPM packages are installed on all nodes in the cluster:
- python3.12
- freetype
- libgomp
- libpq

Note:

Support for the preceding versions of Red Hat Enterprise Linux (RHEL) are contingent on public support provided by RHEL. If a version of RHEL reaches its end of service (EoS) date, the version will not be supported in Execution Engine for Apache Hadoop.

Ports

An external port for the gateway service.

An internal port for the Hadoop Integration service.

Internal ports for Livy and Jupyter Enterprise Gateway service for Spark 3.0. This is required if you only want the Execution Engine for Apache Hadoop service to install Livy.

Note:

If you are using the existing Livy service on the Hadoop cluster, you can expect some issues. You should consider using the Livy service shipped with Hadoop Execution Engine. For more information, see Apache Issue - Livy.

Configuration requirements

AUTO TLS is required to run the Jupyter Enterprise Gateway (JEG) service and connect to Spark.

Supported Hadoop versions

Cloudera Data Platform (CDP) version 7.1.7-SP1, 7.1.7-SP2, 7.1.7-SP3, 7.1.9, 7.1.9-SP1, 7.3.1

Cloudera Data Platform (CDP) version 7.1.9, 7.1.9-SP1, 7.3.1

Support for the preceding versions of Cloudera Data Platform (CDP) are contingent on public support provided by Cloudera. If a version of CDP reaches its end of life (EoL) date, the version will not be supported in Execution Engine for Apache Hadoop.

Platforms supported: The Execution Engine for Apache Hadoop service is supported on all x86 platforms supported by Cloudera Data Platform (CDP) versions listed above.

Root user requirements

The Execution Engine for Apache Hadoop service runs as a root user. If you install the service on multiple edge nodes for high availability, the same root user should be used. This user must meet the following requirements:

A valid Linux user on the node where the Execution Engine for Apache Hadoop service is installed.
Home directory created in HDFS. The directory should have both owner and group assigned as the root user.
Necessary proxy user privileges in Hadoop. For example, to configure a proxy root user as a proxy user for CDP, you’ll need to set the following values in the core-site.xml properties for the root user:
```
hadoop.proxyuser.<proxy_user>.hosts
hadoop.proxyuser.<proxy_group>.groups hadoop.proxyuser.<proxy_user>.users
```
If you're using an existing Livy service running on the Hadoop cluster, the root user should have the necessary super user privileges in Livy services.
For a cluster with Kerberos security enabled, you must generate two keytabs to install Execution Engine for Apache Hadoop. The keytab files must have ownership of the root user because it eliminates the need for every Watson Studio user to have a valid keytab.

Use the following example as a guide to help you generate a keytab after the Key Distribution Center is configured:

# assumptions for following example
service_user: dsxhi
host_FQDN: *.examplehost.example.com

# service_user keytab
kadmin.local -q "addprinc dsxhi"
kadmin.local -q "xst -norandkey -k dsxhi.keytab dsxhi"

# SPNEGO keytab
kadmin.local -q "addprinc -randkey HTTP/*.examplehost.example.com"
kadmin.local -q "xst -norandkey -k spnego.*.examplehost.keytab HTTP/*.examplehost.example.com"

Setting up the installation user (root user or non-root user)

To install the Execution Engine for Apache Hadoop service as a root user, the keytab file must have ownership of the root user:

yum install <rpm>
chown <service_service> -R /opt/ibm/dsxhi/
edit/generate /opt/ibm/dsxhi/conf/dsxhi_install.conf
cd /opt/ibm/dsxhi/bin
python /opt/ibm/dsxhi/bin/install.py

To install Execution Engine for Apache Hadoop as a non-root user, you must grant the non-root user permissions using the visudo command:

Apply visudo rules for non-root user
su <non-root_user>
sudo yum install <rpm>
sudo chown <non-root_user:non-root_user> -R /opt/ibm/dsxhi/
edit/generate /opt/ibm/dsxhi/conf/dsxhi_install.conf
cd /opt/ibm/dsxhi/bin
sudo python /opt/ibm/dsxhi/bin/install.py

VISUDO Template:

## DSXHI
<non-root_user> ALL=(root) NOPASSWD: /usr/bin/yum install <path-to-rpm/rpm>, /usr/bin/yum erase dsxhi*, /usr/bin/chown * /opt/ibm/dsxhi/, /usr/bin/python /opt/ibm/dsxhi/*

Installing the service

You must have administrator permissions to complete these steps.

To install the Execution Engine for Apache Hadoop service on the Hadoop cluster:

Locate the RPM installer at Passport Advantage. You'll see the image for the service.
Run the RPM installer. The rpm is installed in /opt/ibm/dsxhi.
If you're running the install as the root user, run sudo chown <serviceuser\> -R /opt/ibm/dsxhi.
Create a /opt/ibm/dsxhi/conf/dsxhi_install.conf file by using /opt/ibm/dsxhi/conf/dsxhi_install.conf.template.CDH file as a reference. See Template parameters for installing the service on Apache Hadoop clusters for more information on the parameters that you can use from the dsxhi_install.conf.template.CDH template.
Optional: If you need to set additional properties to control the location of Java, use a shared truststore or pass additional Java options and create a /opt/ibm/dsxhi/conf/dsxhi_env.sh script to export the environment variables:
```
export JAVA="<JAVA_PATH>/bin/java"
```
```
export JAVA_CACERTS=/etc/pki/java/cacerts
```
```
export DSXHI_JAVA_OPTS="-Djavax.net.ssl.trustStore=$JAVA_CACERTS"
```
The installation will run pre-checks to validate the prerequisites. If the cluster_manager_url is not specified in the dsxhi_install.conf file, then the pre-checks on the proxyuser settings will not be performed.

After the service is installed, the necessary components, such as the gateway service and the Hadoop Integration service and optional components (Livy for Spark) will be started.
If WebHDFS SSL is enabled after the service is installed, the Hadoop admin should add the certificates for each namenode and datanode to the trust store of the gateway service and update the topology files.
1. In /opt/ibm/dsxhi/bin/util, run ./add_cert.sh https://host:port <cacert_password> for each of the namenode and datanodes
2. Manually update/opt/ibm/dsxhi/gateway/conf/topologies*.xml to use the HTTPS URL and Port and use the following example:
```
<service>
  <role>WEBHDFS</role>
  <url>https://NamenodeHOST:PORT</url>
</service>
```
Optional: Add conda channels for Cloud Pak for Data image push operations and dynamic package installation.

When a Cloud Pak for Data administrator pushes a runtime image to Hadoop using the Hadoop push type, package resolution occurs using conda rc files that are installed with the HI rpm. The conda rc files exist at the following locations. These files are also used by the "hi_core_utils.install_packages()" utility method.
- /user/<dsxhi-svc-user>/environments/conda/conda_rc_x86_64.yaml
- /user/<dsxhi-svc-user>/environments/conda/conda_rc_ppc64le.yaml
A Hadoop admin can edit the conda rc files to add more channels, which is useful when:
- The pushable runtime images from Cloud Pak for Data require additional channels for package resolution. The Hadoop administrator might need to do this if the Hadoop image push operations are failing with package resolution errors.
- One or more packages that a user would like to install with "hi_core_utils.install_packages()" requires additional channels. If the Hadoop administrator wants to expose those channels for all users of the system, the administrator can add the necessary channels to the conda rc files indicated above.
- When editing the files, the Hadoop administrator should ensure that the file continues to be owned by the DSXHI root user and retains 644 permissions.
Note: In general, adding channels to these conda rc files increases the time it takes to resolve packages and can increase the time it takes for the relevant operations to complete.

Restoring the installation from a previous release

Use the following procedures if installing the new release of the service fails. The procedure guides you in restoring the previous release of the service.

Run the uninstall script on the newer version:
```
cd /opt/ibm/dsxhi/bin
./uninstall.py
```

Create a backup of the install.conf file:

cp /opt/ibm/dsxhi/conf/dsxhi_install.conf /tmp/dsxhi_install.conf

Remove the installation package:
```
yum erase -y dsxhi
```
Install the old HEE package:
```
yum install -y <old_dsxhi_rpm_file>
```

Replace the dsxhi_install.conf in the conf directory:

cp /tmp/dsxhi_install.conf /opt/ibm/dsxhi/conf/

Run the install:
```
cd /opt/ibm/dsxhi/bin
./install.py
```

Add the Watson Studio cluster URL:

cd /opt/ibm/dsxhi/bin
./manage_known_dsx -a <wsl_cluster_URL_to_add>

From Watson Studio, register the service from the Hadoop Execution Engine tab in the Configuring and settings section.

Configuring custom certificates

You can use your existing certificates and not have to modify the system truststore. The following configuration properties convert DSXHI to do the following customizations:

custom_jks: DSXHI typically generates a Keystore, converts it to a .crt, and adds the .crt to the Java Truststore. However, with this configuration, DSXHI allows you to provide a custom Keystore that can be used to generate the required .crt.
dsxhi_cacert: DSXHI previously detected the appropriate truststore to use as part of the installation. With the dsxhi_cacert property, DSXHI allows you to provide any custom truststore (CACERTS), where DSXHI certs are added.
add_certs_to_truststore: This configuration provides options to either add the host certificate to the truststore yourself or DSXHI adds it. If you set the configuration to False, users must add the host certificate to the truststore themselves. DSXHI doesn't make any changes to the truststore. If you set the configuration to True, DSXHI retains its default behavior to add host certificate to java truststore and on detected datanodes for gateway and web services.

Learn more

Administering Cloud Pak for Data clusters
Administering Apache Hadoop clusters
Uninstalling the service on a Hadoop cluster for information on uninstalling the service.