Administering Apache Hadoop clusters

You can manage and monitor the tools that are available with the Execution Engine for Apache Hadoop service.

Connecting to Alluxio
Managing services
Logs
Configuring the Hadoop integration with Cloud Pak for Data

Connecting to Alluxio

To connect to Alluxio, use a remote Spark with Livy and go to Ambari > HDFS > Config > Custom Core-site > Add property in the Ambari web client and fs.alluxio.impl configuration for the remote Spark.

See Running Spark on Alluxio for details.

Managing services

Periodically, the Hadoop admin must manage the Execution Engine for Apache Hadoop service.

Check the status of the services

In /opt/ibm/dsxhi/bin, run ./status.py to check the status of services.
Start the services

In /opt/ibm/dsxhi/bin, run ./start.py to start the services.
Stop the services

In /opt/ibm/dsxhi/bin, run ./stop.py to stop the services.

Managing access for Watson Studio

To maintain control over the access to an Execution Engine for Apache Hadoop service, the Hadoop admin should maintain a list of known Watson Studio clusters that can access the service. A Watson Studio cluster will be known by its URL, which should be passed in when adding to, refreshing or deleting from the known list. A comma-separated list of Watson Studio clusters can be passed for each of the operations. Regardless of the order in which the arguments for add and delete are specified, the deletes are applied first and then the adds.

Working with the Watson Studio clusters

To add Watson Studio clusters to the known list:

In /opt/ibm/dsxhi/bin, run ./manage_known_dsx.py –-add "url1,url2...urlN". Once a Watson Studio Local cluster is added to the known list, the necessary authentication will be setup and a secure URL will be generated for the Watson Studio cluster.

To refresh Watson Studio cluster in the known list:

If the Watson Studio cluster was re-installed, you can refresh the information. In /opt/ibm/dsxhi/bin, run ./manage_known_dsx.py –-refresh "url1,url2...urlN".

To delete Watson Studio cluster from the known list:

In /opt/ibm/dsxhi/bin, run ./manage_known_dsx.py –-delete "url1,url2...urlN".

To view known list of Watson Studio clusters:

In /opt/ibm/dsxhi/bin, run ./manage_known_dsx.py –list to view the Watson Studio clusters and the associated URL.

After the Hadoop administrator adds a Watson Studio cluster to the known list maintained by the Execution Engine on Apache Hadoop service, the Watson Studio admin can register the service on the Watson Studio cluster using the secure URL and the service user. Learn how to register Hadoop clusters.

Logs

The component logs are stored in /var/log/dsxhi,/var/log/livy, and /var/log/livy2.
The component PIDs are stored in /var/log/dsxhi,/var/log/livy, and /var/log/livy2, and /opt/ibm/dsxhi/gateway/logs/.
The log level for the gateway service can be set by editing /opt/ibm/dsxhi/gateway/conf/gateway-log4j.properties and setting the appropriate level in thelog4j.logger.org.apache.knox.gateway property.

Configuring the Hadoop integration with Cloud Pak for Data

When you’re working with Hadoop clusters, the Hadoop admin might want to implement additional configurations when a user application requests resources from a Hadoop system. For example, these requests might be a YARN queue or Spark application sizes. Configuring Hadoop allows the Hadoop admin more control over jobs that are submitted through Execution Engine for Apache Hadoop using Execution Engine for Apache Hadoop.

Once the additional configurations are defined, a list of parameters and keys are available and data scientists can update the settings from the Hadoop environment.

Important: If a Hadoop cluster is already registered in Watson Studio, the Watson Studio administrator must navigate to that existing Hadoop registration, and click Refresh to pick up the configuration changes that the Hadoop administrator configured.

Setting up a list of Spark configuration parameters

To implement the additional configurations, you must modify the yarnJobParams*.json file. The files include:

Cloudera System: /opt/ibm/dsxhi/conf/yarnJobParamsCDH.json
Hortonworks System: /opt/ibm/dsxhi/conf/yarnJobParamsHDP.json

To modify the file:

Determine which file to modify based on your Hadoop cluster.
Back up the file before you modify it.
Modify the file. See these details and examples for more information.
Save the file.
Verify that the content is still a valid json file by using the following command:

Cloudera System: Run cat yarnJobParamsCDH.json | python -m json.tool

Hortonworks System: Run cat yarnJobParamsHDP.json | python -m json.tool

Confirm that the command returns a json formatted object.
If you have a Cloud Pak for Data cluster that has the Hadoop cluster registered, ask your Cloud Pak for Data admin to click Refresh on the registration details page. Refreshing allows the new configurations to be retrieved.

Details on the content of the json files

The file contains three main sections:

scriptLanguages: These are options that you can use to enable R Script to run on your Hadoop system. R must be enabled on your Hadoop system to use this feature. Confirm that R is installed on all of your Hadoop nodes, and then edit this json file as follows:

        "scriptLanguages": [{

                        "language": "R",

                        "version": "system",

                        "enabled": "true"

                }

        ],

jobOptions: These are options that are tied to the UI and should not be removed. The values and bounds can be modified. The description of each entry describes what it is used for. Note that some of these options only apply to Watson Studio Local 1.2.3.x. These are labeled with 1.2.3.x. Some options are labeled yarn. Tnose options are used by Cloud Pak for Data 3.0.1, and can be customized by a Hadoop administrator.
extraOptions: These are extra options for Spark that the Hadoop admin can set either with a default value value or not. If value is specified, this value is always used when Cloud Pak for Data issues a call to create a JEG session or livy connection through Data Refinery. When these sessions are created, options are translated to --conf option=value for JEG, and spark_config$option <- value for the livy connection.

If a specific Spark option is not listed, it can be added in as an entry. Review the Spark configuration options for the specific version of Spark 2.x that is running on the Hadoop system.

Example: Providing a list of available queues: default, long_running, short

                {
                        "description": "The YARN queue against which the job should be submitted",
                        "displayName": "List of available YARN queues",
                        "name": "spark.yarn.queue",
                        "type": "enum",
                        "value": "[\"default\", \"long_running\", \"short\"]",
                        "labels": [ "spark" ]
                }

Example: Increase the range of available driver memory to 8196 (allowing the user to request more memory)

                {
                        "description": "Available driver memory",
                        "displayName": "Driver memory (in MB)",
                        "max": "8196",
                        "min": "1024",
                        "name": "spark.driver.memory",
                        "type": "int",
                        "labels": [ "spark" ]
                }

Example: If user needs to tune spark.memory.fraction, add this entry:

               {
                       "name": "spark.memory.fraction",
                       "type": "float",
                       "min" : "0.1",
                       "max" : "0.9999",
                       "value": ""
                }
         if admin determines that the default spark.memory.fraction should always be 0.9, it is set in the value field.
               {
                       "name": "spark.memory.fraction",
                       "type": "float",
                       "min" : "0.1",
                       "max" : "0.9999",
                       "value": "0.9"
                }