Establishing connection to Cloud Pak for Data clusters

After the Execution Engine for Apache Hadoop service is installed, one of the administrative tasks that must be done is to register the remote Hadoop clusters. Registering the remote cluster integrates Cloud Pak for Data clusters. Data scientists can then access data and submit jobs with high availability.

For the Cloud Pak for Data cluster to communicate with the remote cluster, the OpenShift DNS operator of the Cloud Pak for Data cluster must be able to resolve:

The hostname of the node of the remote cluster on which the Execution Engine for Apache Hadoop service is installed
The host that some services depend on.

Registering remote clusters

The Execution Engine for Apache Hadoop service integrates Cloud Pak for Data clusters to securely access data and submit jobs on the Hadoop cluster with high availability.

The remote cluster admin must first install the Execution Engine for Apache Hadoop service on the edge nodes of the remote Hadoop cluster. Then, add the Cloud Pak for Data cluster to the service that is installed on each of the edge nodes, and provide the secure URLs and the service user for the service.

A Cloud Pak for Data administrator can then complete the following tasks:

Register a remote cluster
Push runtime images to the registered remote cluster

Registering a remote cluster

To register a remote cluster:

Sign in as the administrator.
From the navigation menu (), select Administration > Configurations and settings.
Select the Hadoop Execution Engine tile.
Click New integration.
Assign the registration a name and provide the Service URLs and Service User ID that you received from the remote cluster admin.
1. Optional: You can add another service URL to create a secondary node for when an active node goes down to enable high availability mode. For more information, see High availability.

To view details of the registered remote cluster or delete the cluster, see Managing connection to Cloud Pak for Data clusters.

If registration fails, see troubleshooting steps.

Pushing runtime images to the registered remote cluster

You can use the Python packages and custom libraries that are installed in a Jupyter Python environment when they’re working with models on a remote cluster. These remote images can be used to run notebooks, notebook jobs, and Python script jobs (for Hadoop clusters only).

Restriction: This feature is not supported for RStudio and Jupyter with GPU images.

Requirements:

The Watson Studio and Hadoop cluster must be on the same platform architecture for working with the runtime images.
Only Administrators can push runtime images to the registered remote cluster.

To push a runtime image to the Hadoop cluster:

Go to the Details page of the Execution Engine for Apache Hadoop service.
As an Admin, use the push operation to build an image archive on the Cloud Pak for Data cluster.
Push the archive as-is to HDFS on the target Hadoop cluster.

In the time that it takes to push an image, the node might go down. If the node goes down, retry and push the image again.

For remote Hadoop clusters, an image archive is first built on the Cloud Pak for Data cluster, and then pushes that archive as-is to HDFS on the target Hadoop cluster. This operation is useful when the architecture and GCC library versions on the Hadoop cluster nodes are compatible with those on the Cloud Pak for Data cluster nodes.

If you modified any of the runtime images locally, you can update it on the remote cluster by clicking Replace Image next to it.

Note: The Conda version of the image that pushed must match one of the available Anaconda instances that was defined.

After the image is pushed, a subset of libraries that are available for the Python environments are local to Cloud Pak for Data. The subset of libraries can be sent over to the remote Anaconda instances, and then be created. They are created by using the specified Anaconda channels. Users on Cloud Pak for Data can then use the new Anaconda environment to run their notebooks, providing a similar experience to running the notebook on local Cloud Pak for Data environment.

The set of Python libraries are filtered so that internal IBM packages are not exported.