Administering Apache Hadoop clusters

You can manage and monitor the tools that are available with the Execution Engine for Apache Hadoop service.

Connecting to Alluxio

To connect to Alluxio, use a remote Spark with Livy and go to Ambari > HDFS > Config > Custom Core-site > Add property in the Ambari web client and fs.alluxio.impl configuration for the remote Spark.

See Running Spark on Alluxio for details.

Managing services

Periodically, the Hadoop admin must manage the Execution Engine for Apache Hadoop service.

Managing access for Watson Studio

To maintain control over the access to an Execution Engine for Apache Hadoop service, the Hadoop admin should maintain a list of known Watson Studio clusters that can access the service. A Watson Studio cluster will be known by its URL, which should be passed in when adding to, refreshing or deleting from the known list. A comma-separated list of Watson Studio clusters can be passed for each of the operations. Regardless of the order in which the arguments for add and delete are specified, the deletes are applied first and then the adds.

Working with the Watson Studio clusters

To add Watson Studio clusters to the known list:

To refresh Watson Studio cluster in the known list:

To delete Watson Studio cluster from the known list:

To view known list of Watson Studio clusters:

After the Hadoop administrator adds a Watson Studio cluster to the known list maintained by the Execution Engine on Apache Hadoop service, the Watson Studio admin can register the service on the Watson Studio cluster using the secure URL and the service user. Learn how to register Hadoop clusters.

Logs

Configuring the Hadoop integration with Cloud Pak for Data

When you’re working with Hadoop clusters, the Hadoop admin might want to implement additional configurations when a user application requests resources from a Hadoop system. For example, these requests might be a YARN queue or Spark application sizes. Configuring Hadoop allows the Hadoop admin more control over jobs that are submitted through Execution Engine for Apache Hadoop using Execution Engine for Apache Hadoop.

Once the additional configurations are defined, a list of parameters and keys are available and data scientists can update the settings from the Hadoop environment.

Important: If a Hadoop cluster is already registered in Watson Studio, the Watson Studio administrator must navigate to that existing Hadoop registration, and click Refresh to pick up the configuration changes that the Hadoop administrator configured.

Setting up a list of Spark configuration parameters

To implement the additional configurations, you must modify the yarnJobParams*.json file. The files include:

To modify the file:

  1. Determine which file to modify based on your Hadoop cluster.
  2. Back up the file before you modify it.
  3. Modify the file. See these details and examples for more information.
  4. Save the file.
  5. Verify that the content is still a valid json file by using the following command:

    Cloudera System: Run cat yarnJobParamsCDH.json | python -m json.tool

    Hortonworks System: Run cat yarnJobParamsHDP.json | python -m json.tool

    Confirm that the command returns a json formatted object.

  6. If you have a Cloud Pak for Data cluster that has the Hadoop cluster registered, ask your Cloud Pak for Data admin to click Refresh on the registration details page. Refreshing allows the new configurations to be retrieved.

Details on the content of the json files

The file contains three main sections:


        "scriptLanguages": [{

                        "language": "R",

                        "version": "system",

                        "enabled": "true"

                }

        ],

If a specific Spark option is not listed, it can be added in as an entry. Review the Spark configuration options for the specific version of Spark 2.x that is running on the Hadoop system.

Example: Providing a list of available queues: default, long_running, short

                {
                        "description": "The YARN queue against which the job should be submitted",
                        "displayName": "List of available YARN queues",
                        "name": "spark.yarn.queue",
                        "type": "enum",
                        "value": "[\"default\", \"long_running\", \"short\"]",
                        "labels": [ "spark" ]
                }

Example: Increase the range of available driver memory to 8196 (allowing the user to request more memory)

                {
                        "description": "Available driver memory",
                        "displayName": "Driver memory (in MB)",
                        "max": "8196",
                        "min": "1024",
                        "name": "spark.driver.memory",
                        "type": "int",
                        "labels": [ "spark" ]
                }

Example: If user needs to tune spark.memory.fraction, add this entry:

               {
                       "name": "spark.memory.fraction",
                       "type": "float",
                       "min" : "0.1",
                       "max" : "0.9999",
                       "value": ""
                }
         if admin determines that the default spark.memory.fraction should always be 0.9, it is set in the value field.
               {
                       "name": "spark.memory.fraction",
                       "type": "float",
                       "min" : "0.1",
                       "max" : "0.9999",
                       "value": "0.9"
                }