Administering Cloud Pak for Data clusters

After the Execution Engine for Apache Hadoop service is installed, one of the administrative tasks that must be done is to register the remote Hadoop clusters. Registering the remote cluster integrates Cloud Pak for Data clusters. Data scientists can then access data and submit jobs with high availability.

For the Cloud Pak for Data cluster to communicate with the remote cluster, the OpenShift DNS operator of the Cloud Pak for Data cluster must be able to resolve:

The host name of the node of the remote cluster on which the Execution Engine for Apache Hadoop service is installed
The host that some services depend on.

Registering remote clusters

The Execution Engine for Apache Hadoop service integrates Cloud Pak for Data clusters to securely access data and submit jobs on the Hadoop cluster with high availability.

The remote cluster admin must first install the Execution Engine for Apache Hadoop service on the edge nodes of the remote Hadoop cluster. Then add the Cloud Pak for Data cluster to the service that is installed on each of the edge nodes, and provide the secure URLs and the service user for the service.

A Cloud Pak for Data administrator can then perform the following tasks:

Register a remote cluster
View details about the registered cluster
Push runtime images to the registered remote cluster
Handle high availability for the Execution Engine for Apache Hadoop service

Register a remote cluster

Sign in as the administrator, click the menu icon ( ) and click Administration > Configurations > Systems integration to register your remote clusters. Click New integration. Assign the registration a name and provide the Service URLs and Service User ID that you received from the remote cluster admin.

Tip: The following troubleshooting steps can be performed if the registration fails:

Ensure that the URL provided during the registration is correct. Refer to the Managing access for Watson Studio section in Administering Apache Hadoop clusters.
Contact the Hadoop admin who installed the service on the Hadoop cluster and ensure that the service user ID that was provided during the registration is correct.
Ensure that the Openshift DNS operator is configured to successfully resolve the hostname in the URL provided during the registration.
Contact the Openshift administrator to inspect the logs of the utils-api pod for further diagnostics information.

Deleting a remote cluster registration

If you need to delete a registration, be aware that user assets that depend on this registration will no longer work properly. This includes connectors, environments, jobs, and notebooks that depend on the environments.

If a registration is later created with the same ID, users must still re-create the environment, and update all jobs and notebooks to reference the newly created environment in order for the assets to work properly. If the registration is created with a different ID, users must update connections to ensure that the URL that is referenced is correct in addition to updates needed for the jobs and notebooks.

If you need to refresh the registration, such as if you re-installed on the remote cluster for Execution Engine for Apache Hadoop, select the registration. Refresh the certificates first, and then wait a few minutes to allow the dependent pod to be re-created. Refresh the endpoints to ensure all configurations are refreshed.

View details about the registered cluster

In the Details page of each registration, you can view the endpoints, the edge nodes in the high availability setup, and runtimes.

For Hadoop remote clusters, depending on the services that are exposed when you’re installing and configuring the Execution Engine for Apache Hadoop service, the details page lists the WebHDFS, WebHCAT, Livy spark 2, and JEG.

Push runtime images to the registered remote cluster

Data scientists can leverage the Python packages and custom libraries that are installed in a Jupyter Python environment when they’re working with models on remote cluster. These remote images can be used to run notebooks, notebook jobs, and Python script jobs (for Hadoop clusters only)

Restriction: This feature is not supported for RStudio and Jupyter with GPU images.

Requirement: The Watson Studio and Hadoop cluster must be on the same platform architecture for working with the runtime images.

To push a runtime image to the remote cluster from its Details page, an admin uses the push operation to instantiate the process.

For remote Hadoop clusters, an image archive is first built on the Cloud Pak for Data cluster, and then pushes that archive as-is to HDFS on the target Hadoop cluster. This operation is useful when the architecture and GCC library versions on the Hadoop cluster nodes are compatible with those on the Cloud Pak for Data cluster nodes.

Note: The Conda version of the image being pushed must match one of the available Anaconda instances that was defined.

After the image is pushed, a subset of libraries that are available for the Python environments are local to Cloud Pak for Data. The subset of libraries can be sent over to the remote Anaconda instances, and then be created. They are created using the specified Anaconda channels. Users on Cloud Pak for Data can then leverage the new Anaconda environment to run their notebooks, providing a similar experience to running the notebook on local Cloud Pak for Data environment.

The set of Python libraries are filtered so that internal IBM packages are not exported.

Pushing the image can take a long time. If the node goes down, retry and push the image again.

If you modified any of the runtime images locally, you can update it on the remote cluster by clicking Replace Image next to it.

Runtimes can have the following statuses:

Available on Cloud Pak for Data, but not pushed to the registered remote cluster. Users can either push or refresh the environment to the registered remote cluster.
Pending transfer from Cloud Pak for Data to the registered remote cluster.
Failed transfer from Cloud Pak for Data to the registered remote cluster.
Push succeeded on the registered remote cluster.

Handle high availability for the Execution Engine for Apache Hadoop service

Edge node failure

If there's an edge node failure in the remote environment, the following activities occur:

Data Access via WebHDFS

Data browse and preview tools accessing WebHDFS is reconnected to the next available edge node.
Interactive notebooks

Any active Livy Sessions that resided on the failed node must be restarted and run again.
Data Refinery

Any running shaping jobs must be re-submitted. Any new jobs started is sent to the active edge node.
Remote jobs

Any jobs running on this remote environment must be re-submitted. Any new jobs started is sent to the Active edge node.

Load Balancing with multiple Execution Engine for Apache Hadoop Edge nodes

WebHDFS transfers are allocated with round robin, balancing the network traffic between Watson Studio and the Hadoop edge nodes.
Livy Sessions are allocated with sticky sessions, following an active and passive approach. All Livy sessions are run on the same Execution Engine edge node until a failure is detected, at which point all new sessions are allocated on the next available Execution Engine edge node.
Similar to Livy, JEG sessions are allocated with sticky sessions and follow an active and passive approach. All JEG sessions are run on the same Execution Engine edge node until a failure is detected, at which point all new sessions are allocated on the next available Execution Engine edge node.

Parent topic: Administering Execution Engine for Apache Hadoop