Installing the service on Apache Hadoop clusters
Before a project administrator can install Execution Engine for Apache Hadoop on the Hadoop cluster, the service must be first installed on Cloud Pak for Data. Review and confirm that you meet the following requirements and are aware of the ecosystem services, supported Hadoop versions and platforms before you install the service:
- System requirements
- Requirements for a service user installing the service
- Hadoop ecosystem services
- Supported Hadoop versions and platforms
- Installing the service
- Configuring custom certificates
Requirements
System requirements for installing Execution Engine for Apache Hadoop
-
Edge node hardware requirements
- 8 GB memory
- 2 CPU cores
- 100 GB disk, mounted and available on /var in the local Linux file system. The installation creates the following directories and these locations are not configurable:
- To store the logs: /var/log/dsxhi and /var/log/livy2.
- To store the process IDs: /var/run/dsxhi and /var/run/livy2.
- 10 GB network interface card recommended for multi-tenant environment (1 GB network interface card if WEBHDFS will not be heavily utilized)
-
Edge node software requirements
- Python 2.7
- CDH only: Have Java Development Kit Version 1.8 installed.
- curl 7.19.7-53
- CDH only: HDFS Gateway Role, YARN Gateway Role, Hive Gateway Role, Spark2 Gateway Role.
- For clusters with Kerberos security enabled, a SPNEGO keytab.
- For clusters without Kerberos security enabled, write permissions for the yarn user for all directories that a YARN job will write to.
-
Ports
- An external port for the gateway service.
- An internal port for the Hadoop Integration service.
- Internal ports for Livy for Spark2 and Jupiter Enterprise Gateway service. This is required if you want only the Execution Engine for Apache Hadoop service to install Livy.
-
Supported Hadoop versions
CDH
- CDH version 6.2
- Spark 2.2.0 required to use Livy included with the Execution Engine for Apache Hadoop service.
-
Platforms supported
The Execution Engine for Apache Hadoop service is supported on all x86 platforms supported by HDP and CDH versions listed above.
-
Service user requirements
The Execution Engine for Apache Hadoop service runs as a service user. If you install the service on multiple edge nodes for high availability, the same service user should be used. This user needs to meet the following requirements:
- A valid Linux user on the node where the Execution Engine for Apache Hadoop service is installed.
- Home directory created in HDFS. The directory should have both owner and group assigned as the service user.
- Necessary proxy user privileges in Hadoop. For example, to configure a proxy service user as a proxy user for HDFS on CDH, you’ll need to set the following values in the
core-site.xml
properties for the service user:hadoop.proxyuser.<proxy_user>.hosts hadoop.proxyuser.<proxy_group>.groups hadoop.proxyuser.<proxy_user>.users
See an example of a service user here.
Note: For more information on configuring proxy users for HDFS, see the HDP and CDH documentation.
- Necessary proxy user privileges in WebHCAT if you’re enabling WebHCAT. For example, to configure service user as a proxy user for WebHCAT on CDH, you’ll need to set the following values in the
core-site.xml
properties for the service user:hadoop.proxyuser.HTTP.hosts hadoop.proxyuser.HTTP.groups properties
Note: For more information on configuring proxy users for WebHCAT, see the HDP and CDH documentation.
- If you’re using an existing Livy service running on the Hadoop cluster, the service user should have the necessary super user privileges in Livy services.
- HDP only: If Hadoop or Ranger KMS is enabled, the service user should have necessary proxyuser privileges in
kms-site.xml
. - For a cluster with Kerberos security enabled, the keytab file must have ownership of the service user. This eliminates the need for every Watson Studio user to have a valid keytab.
Requirements for a service user installing the Execution Engine for Apache Hadoop service
If you plan to install the Execution Engine for Apache Hadoop service as the service user, the keytab file must have ownership of the service user.
Service user example
Use the following example to set up proxyuser settings in core-site.xml
for a service user:
<property>
<name>hadoop.proxyuser.svc_dsxhi.hosts</name>
<value>node1.mycompany.com,node2.mycompany.com</value>
</property>
<property>
<name>hadoop.proxyuser.svc_dsxhi.groups</name>
<value>groupa,groupb</value>
</property>
Steps for DSXHI non-root installation:
If you plan to install Execution Engine for Apache Hadoop as an non-root user, you’ll need to grant the non-root user permissions using the visudo command:
- Apply visudo rules for non-root user
- su
<non-root_user>
- sudo yum install
<rpm>
- sudo chown
<non-root_user:non-root_user> -R /opt/ibm/dsxhi/
- edit/generate
/opt/ibm/dsxhi/conf/dsxhi_install.conf
- cd
/opt/ibm/dsxhi/bin
- sudo python
/opt/ibm/dsxhi/bin/install.py
VISUDO Template:
## DSXHI
<non-root_user> ALL=(root) NOPASSWD: /usr/bin/yum install <path-to-rpm/rpm>, /usr/bin/yum erase dsxhi*, /usr/bin/chown * /opt/ibm/dsxhi/, /usr/bin/python /opt/ibm/dsxhi/*
Hadoop ecosystem services
Watson Studio interacts with a Hadoop cluster through the following four services:
Service | Purpose |
---|---|
WebHDFS | Browse and preview HDFS data |
WebHCAT | Browse and preview Hive data (Watson Studio Local 1.2.x only) |
Jupyter Enterprise Gateway | Submit jobs to JEG on the Hadoop cluster. |
Livy for Spark2 | Submit jobs to Spark2 on the Hadoop cluster. |
Watson Studio user
Every user that is connecting from Watson Studio must be a valid user on the Hadoop cluster. The recommended way to achieve this is by integrating Watson Studio and the Hadoop cluster with the same LDAP.
Installing the service
- Locate the RPM installer at Passport Advantage. You’ll see the image for the service.
- Run the RPM installer. The rpm is installed in
/opt/ibm/dsxhi.
- If you’re running the install as the service user, run
sudo chown <serviceuser\> -R /opt/ibm/dsxhi
. -
Create a
/opt/ibm/dsxhi/conf/dsxhi_install.conf
file by using/opt/ibm/dsxhi/conf/dsxhi_install.conf.template.HDP
,/opt/ibm/dsxhi/conf/dsxhi_install.conf.template.CDH
, or/opt/ibm/dsxhi/conf/dsxhi_install.conf.template.SPECTRUM
files as a reference. See Template parameters for installing the service on Apache Hadoop clusters for more information on the parameters that you can use from thedsxhi_install.conf.template.HDP
,dsxhi_install.conf.template.CDH
,/opt/ibm/dsxhi/conf/dsxhi_install.conf.template.SPECTRUM
templates.Optional: If you need to set additional properties to control the location of Java, use a shared truststore, or pass additional Java options, create a
/opt/ibm/dsxhi/conf/dsxhi_env.sh
script to export the environment variables:export JAVA="/usr/jdk64/jdk1.8.0_112/bin/java"
export JAVA_CACERTS=/etc/pki/java/cacerts
export DSXHI_JAVA_OPTS="-Djavax.net.ssl.trustStore=$JAVA_CACERTS"
- In
/opt/ibm/dsxhi/bin
, run the./install.py
script to install the service. The script prompts for inputs on the following options (alternatively, you can specify the options as flags):- Accept the license terms (Hadoop registration uses the same license as Watson Studio). You can also accept the license through the
dsxhi_license_acceptance
property indsxhi_install.conf
. - You will be prompted for the password for the cluster administrator of the Ambari or Cloudera Manager. The value can also be passed through the
--password
flag. - For the master secret for the gateway service, the value can also be passed through the
--dsxhi_gateway_master_password
flag. - If the default password for Java cacerts truststore was changed, the password can be passed through the
--dsxhi_java_cacerts_password
flag.
- Accept the license terms (Hadoop registration uses the same license as Watson Studio). You can also accept the license through the
- The installation will run pre-checks to validate the prerequisites. If the
cluster_manager_url
is not specified in the dsxhi_install.conf file, then the pre-checks on the proxyuser settings will not be performed.
After the service is installed, the necessary components, such as the gateway service and the Hadoop Integration service and optional components (Livy for Spark 2) will be started.
-
Adding certificates for SSL-enabled services
If WebHDFS SSL is enabled after the service is installed, the Hadoop admin should add the certificates for each namenode and datanode to the trust store of the gateway service and update the topology files.
- In
/opt/ibm/dsxhi/bin/util
, run./add_cert.sh https://host:port <cacert_password>
for each of the namenode and datanodes -
Manually update
/opt/ibm/dsxhi/gateway/conf/topologies*.xml
to use the HTTPS URL and Port and use the following example:<service> <role>WEBHDFS</role> <url>https://NamenodeHOST:PORT</url> </service>
- In
-
Adding conda channels for Cloud Pak for Data image push operations and dynamic package installation
When a Cloud Pak for Data administrator pushes a runtime image to Hadoop using the Hadoop push type, package resolution occurs using conda rc files that are installed with the HI rpm. The conda rc files exist at the following locations. These files are also used by the “hi_core_utils.install_packages()” utility method.
/user/<dsxhi-svc-user>/environments/conda/conda_rc_x86_64.yaml
/user/<dsxhi-svc-user>/environments/conda/conda_rc_ppc64le.yaml
A Hadoop admin can edit the conda rc files to add more channels, which is useful when:
-
The pushable runtime images from Cloud Pak for Data require additional channels for package resolution. The Hadoop administrator might need to do this if the Hadoop image push operations are failing with package resolution errors.
-
One or more packages that a user would like to install with “hi_core_utils.install_packages()” requires additional channels. If the Hadoop administrator wants to expose those channels for all users of the system, the administrator can add the necessary channels to the conda rc files indicated above.
-
When editing the files, the Hadoop administrator should ensure that the file continues to be owned by the DSXHI service user and retains 644 permissions.
Note: In general, adding channels to these conda rc files increases the time it takes to resolve packages and can increase the time it takes for the relevant operations to complete.
Configuring custom certificates
You can use your existing certificates and not have to modify the system truststore. The following configuration properties convert DSXHI to do the following customizations:
- custom_jks
- DSXHI typically generates a Keystore, converts it to a
.crt
, and adds the.crt
to the Java Truststore. However, with this configuration, DSXHI allows you to provide a custom Keystore that can be used to generate the required.crt
. - dsxhi_cacert
- DSXHI previously detected the appropriate truststore to use as part of the installation. With the
dsxhi_cacert
property, DSXHI allows you to provide any custom truststore (CACERTS), where DSXHI certs are added. - add_certs_to_truststore
- This configuration provides options to either add the host certificate to the truststore yourself or DSXHI adds it. If you set the configuration to False, users must add the host certificate to the truststore themselves. DSXHI doesn’t make any changes to the truststore. If you set the configuration to True, DSXHI retains its default behavior to add host certificate to java truststore and on detected datanodes for gateway and web services.
Learn more
See Uninstalling the service on a Hadoop cluster for information on uninstalling the service.