Set up Hortonworks Data Platform (HDP) to work with DSX Local

DSX Local allows users to securely access the data residing on a HDP cluster and submit jobs to use the compute resources on the HDP cluster. DSX Local interacts with a HDP cluster through four services: WebHDFS, WebHCAT, Livy for Spark and Livy for Spark2. WebHDFS is used to browse and preview HDFS data. WebHCAT is used to browse and preview Hive tables. Livy for Spark and Livy for Spark2 are used to submit jobs to Spark or Spark2 engines on the Hadoop cluster.

Setting up an HDP cluster for DSX Local entails installing and configuring the four services. Additionally, for kerberized clusters the setup entails configuring a gateway with JWT based authentication to securely authenticate requests from DSX Local users. The following tasks have to be performed by a Hadoop admin.

Versions supported

  • HDP Version 2.5.6 and later fixpacks
  • HDP Version 2.6.2 and later fixpacks

Platforms supported

DSXHI is supported on all platforms supported by the HDP versions.

Available options for set up

Version 1.2 introduces DSXHI service that eases the setup of a HDP cluster for DSX Local. Using DSXHI is the recommended approach, and gives additional functionality of scheduling jobs as YARN application. The approach without using DSXHI continues to be supported.

Option 1: Set up a HDP cluster with DSXHI

DSXHI diagram

The DSXHI service should be installed on an edge node of the HDP cluster. The gateway component will authenticate all incoming request and forward the request to the Hadoop services. In a kerberized cluster, the keytab of the DSXHI service user and the spnego keytab for the edge node will be used to acquire the ticket to communicate with the Hadoop services. All requests to the Hadoop service will be submitted as the DSXL user.

Edge node hardware requirements
  • 8 GB memory
  • 2 CPU cores
  • 100 GB disk, mounted and available on /var in the local Linux file system
  • 10 GB network interface card recommended for multi-tenant environment (1 GB network interface card if WEBHDFS will not be heavily utilized)
Create an edge node
The DSXHI service can be installed on a shared edge node if the resources listed above are exclusively available for DSXHI. To create a new edge node, follow the steps in the HDP documentation. When the edge node is successfully created, it should have the following components:
  • The Hadoop client installed.
  • The Spark client installed if the HDP cluster has a Spark service.
  • The Spark2 client installed if the HDP cluster has a Spark2 service.
  • For a kerberized cluster, have the spnego keytab copied to /etc/security/keytabs/spnego.service.keytab.
Additional prerequisites
In addition, the following requirements should be met on the edge node:
  • Have Python 2.7 installed.
  • Have curl 7.19.7-53 or later to allow secure communication between DSXHI and DSX Local.
  • Have a service user that can run the DSXHI service. This user should be a valid Linux user with a home directory created in HDFS.
  • The service user should have the necessary Hadoop Proxyuser privileges in HDFS, WebHCAT and Livy services to access data and submit asynchronous jobs as DSX Local users.
  • For a kerberized cluster: Have the keytab for the service user. This eliminates the need for every DSX Local user to have a valid keytab.
  • Have an available port for the DSXHI service. The port for DSXHI service should be exposed for access from the DSX Local clusters that need to connect to the HDP cluster.
  • Have an available port for the DSXHI Rest service. This port need not be exposed for external access.
  • Depending on the service that needs to be exposed by DSXHI, have an available port for Livy for Spark and Livy for Spark 2. These ports do not need to be exposed for external access.
Install DSXHI
To install and configure DSXHI service on the edge node, the Hadoop admin must complete the following tasks:
  1. Download the DSXHI RPM file ( dsxhi_<platform>.rpm) to the edge node, for example, dsxhi_x86_64.rpm.
  2. Run the RPM installer. The rpm is installed in /opt/ibm/dsxhi.
  3. Create a /opt/ibm/dsxhi/conf/dsxhi_install.conf file using /opt/ibm/dsxhi/conf/dsxhi_install.conf.template.HDP as a reference. Edit the values in the conf file. For guidance, see the inline documentation in the dsxhi_install.conf.template.HDP files. When installing on Power platform, set package_installer_tool=yum and packages=lapack for the installer to install the necessary packages needed for virtual environments.
  4. In /opt/ibm/dsxhi/bin, run the ./install.py script to install the DSXHI service. The script prompts for inputs on the following options (alternatively, you can specify the options as flags):
    • Accept the license terms (DSXHI uses the same license as DSX Local). You can also accept the license through the dsxhi_license_acceptance property in dsxhi_install.conf.
    • If the Ambari URL is specified in dsxhi_install.conf, you will be prompted for the password for the cluster administrator. The value can also be passed through the --password flag.
    • The master secret for the gateway service. The value can also be passed through the --dsxhi_gateway_masster_password flag.
    • If the default password for Java cacerts truststore has been changed, the password can be passed through the --dsxhi_java_cacerts_password flag.

The installation will run pre-checks to validate the prerequisites. If the cluster_manager_url is not specified in the dsxhi_install.conf file, then the pre-checks on the proxyuser settings will not be performed.

After a successful installation, the necessary components (DSXHI gateway service and DSXHI rest service) and optional components (Livy for Spark and Livy for Spark 2) will be started. The component logs are stored in /var/log/dsxhi, /var/log/livy, and /var/log/livy2. The component PIDs are stored in /var/run/dsxhi, /var/run/livy, /var/run/livy2, and /opt/ibm/dsxhi/gateway/logs/.

To add cacert to the DSXHI rest service, go to the /opt/ibm/dsx/util directory on the edge node and run the add_cert.sh script with the server address, for example, bash add_cert.sh https://master-1.ibm.com:443.

Manage the DSXHI service
Periodically, the Hadoop admin must manage the DSXHI service. These tasks include:
Check status of the DSXHI service
In /opt/ibm/dsxhi/bin, run ./status.py to check the status of DSXHI gateway, DSXHI rest server, Livy for Spark and Livy for Spark services.
Start the DSXHI service
In /opt/ibm/dsxhi/bin, run ./start.py to start the DSXHI gateway, DSXHI rest server, Livy for Spark and Livy for Spark services.
Stop the DSXHI service
In /opt/ibm/dsxhi/bin, run ./stop.py to stop the DSXHI gateway, DSXHI rest server, Livy for Spark and Livy for Spark services.
Add certificates for SSL enabled services
If the WebHDFS service is SSL enabled, the certificates of the nodemanagers and datanodes should to be added to the DSXHI gateway trust store. In /opt/ibm/dsxhi/bin/util, run ./add_cert.sh https://host:port for each of the nodemanagers and datanodes to add the certificates to DSXHI gateway trust store.
Manage DSX Local for DSXHI
To maintain control over the access to a DSXHI service, a Hadoop admin needs to maintain a list of known DSX Local clusters that can access the DSXHI service. A DSX Local cluster will be known by its URL, which should be passed in when adding to or deleting from the known list. A Hadoop admin can add (or delete) multiple DSXL clusters to the known list by passing in a comma separated list of DSXL cluster. Irrespective of the order in which the arguments for add and delete are specified, the deletes are applied first and then the adds.
Add a DSX Local cluster to the known list
In /opt/ibm/dsxhi/bin, run ./manage_known_dsx.py –-add "url1,url2...urlN". Once a DSX Local cluster is added to the known list, the necessary authentication will be setup and the DSX admin can be given a URL to securely connect to the DSXHI service.
Delete a DSX Local cluster from the known list
In /opt/ibm/dsxhi/bin, run ./manage_known_dsx.py –-delete "url1,url2...urlN".
DSXHI URL for secure access from DSX Local
In /opt/ibm/dsxhi/bin, run ./manage_known_dsx.py --list to list a table of all known DSX Local clusters and the associated URL that can be used to securely connect from a DSX Local cluster to a DSXHI service. The DSX admin can then register the DSXHI cluster.
Uninstall the DSXHI service
To uninstall the DSXHI service and remove the files from /opt/ibm/dsxhi, a Hadoop admin can run the ./uninstall.py script in /opt/ibm/dsxhi/bin. The uninstallation logs are stored in /var/log/dsxhi, /var/log/livy, and /var/log/livy2.

Option 2: Set up a HDP cluster without DSXHI

A HDP cluster can be set up for the DSX Local cluster without using the DSXHI service. If your HDP cluster does not use Kerberos security, then ensure DSX Local can access the host and port of the services. No additional configuration is needed. If the HDP cluster uses Kerberos security, follow the steps outlined below.

HDP requirements (see HDP documentation for guidance):

  • Knox service is installed with SSL enabled.
  • The Livy service for Spark or Livy service for Spark 2 must be set up to be accessible through Knox. You can verify this by checking that the service.xml file exists in either /usr/hdp/current/knox-server/data/services/livy/0.1.0/ or /usr/hdp/current/knox-server/data/services/livy2/0.1.0/. See Adding Livy Server as service to Apache Knox for details (adjust the steps accordingly for Livy service for Spark 2).
  • The Livy service for Spark or Livy service for Spark 2 should have the rewrite rule definition to support impersonation. You can verify this by checking that the rewrite.xml file exists in either /usr/hdp/current/knox-server/data/services/livy/0.1.0/ or /usr/hdp/current/knox-server/data/services/livy2/0.1.0/. See Adding Livy Server as service to Apache Knox for details (adjust the steps accordingly for Livy service for Spark 2).
  • In Spark > configs in the Ambari web client, you must edit the Livy conf file for Spark to add the following property: livy.superusers=knox and restart the Spark service.
  • In Spark2 > configs in the Ambari web client, you must edit the Livy conf file for Spark2 to add the following property: livy.superusers=knox and restart the Spark2 service.

To configure the HDP cluster, you must create a new Knox topology named dsx that is based on JWT authentication and has the service entries for Livy for Spark, Livy for Spark 2, WebHDFS, and WebHCAT. Complete the following steps:

  1. Go to https://9.87.654.320/auth/jwtcert (where https://9.87.654.320 represents the DSX Local URL) and save the public SSL certificate jwt.cert. Alternatively, run a curl command to download the SSL certificate from DSX Local:
    curl -k https://9.87.654.320/auth/jwtcert
  2. In the /usr/hdp/current/knox-server/conf/topologies directory of your Knox server, create a new topology for DSX named dsx.xml, and paste the key from the SSL certificate (between BEGIN CERTIFICATE and END CERTIFICATE) into the <value> tag. Also, ensure you have service entries for Livy for Spark, Livy for Spark 2, WebHDFS, and WebHCAT. Example:
    <topology>
    <gateway>
     <provider>
        <role>federation</role>
        <name>JWTProvider</name>
        <enabled>true</enabled>
        <param>
        <name>knox.token.verification.pem</name>
        <value>MIIDb...Zpuw</value>
        </param>
      </provider>
     <provider>
        <role>identity-assertion</role>
        <name>Default</name>
        <enabled>true</enabled>
     </provider>
     <provider>
        <role>authorization</role>
        <name>AclsAuthz</name>
        <enabled>true</enabled>
     </provider>
    </gateway>
    <service>
     <role>LIVYSERVER</role>
     <url>http://9.87.654.323:8998</url>
    </service>                  
    <service>
     <role>LIVYSERVER2</role>
     <url>http://9.87.654.322:8999</url>
    </service>
    <service>
     <role>WEBHDFS</role>
     <url>http://9.87.654.321:50070/webhdfs</url>
    </service>
    <service>
     <role>WEBHCAT</role>
     <url>http://9.87.543.324:50111/templeton</url>
    </service>
    </topology>
  3. Touch the dsx.xml file to update the timestamp on it.
  4. Restart the Knox server to detect the new topology. The URLs for the four services will be:
    https://knoshost:8443/gateway/dsx/webhdfs/v1
    https://knoshost:8443/gateway/dsx/templeton/v1
    https://knoshost:8443/gateway/dsx/livy/v1
    https://knoshost:8443/gateway/dsx/livy2/v1
  5. Configure DSX Local to work with the HDP cluster. See set up for details.