Establishing connection to Apache Hadoop clusters

You can manage and monitor the tools that are available with the Execution Engine for Apache Hadoop service.

Managing services

Periodically, the Hadoop admin must manage the Execution Engine for Apache Hadoop service.

  • Check the status of the services

    In /opt/ibm/dsxhi/bin, run ./status.py to check the status of services.

  • Start the services

    In /opt/ibm/dsxhi/bin, run ./start.py to start the services.

  • Stop the services

    In /opt/ibm/dsxhi/bin, run ./stop.py to stop the services.

Managing access for Cloud Pak for Data

To maintain control over the access to an Execution Engine for Apache Hadoop service, the Hadoop admin should maintain a list of known Cloud Pak for Data clusters that can access the service. A Cloud Pak for Data cluster is known by its URL, which is passed in when you add to, refresh or delete from the known list. A comma-separated list of Cloud Pak for Data clusters can be passed for each of the operations.

Adding Cloud Pak for Data clusters

To add Cloud Pak for Data clusters:

  1. You must add a Cloud Pak for Data cluster URL to an Execution Engine for Apache Hadoop installation in order to connect a Hadoop cluster to Cloud Pak for Data. Run the following script on the Hadoop cluster where Execution Engine for Apache Hadoop is installed:

    ./manage_known_dsx.py –-add "url"
    
  2. After the Cloud Pak for Data cluster is added to the known list maintained by the Execution Engine on Apache Hadoop service, the Cloud Pak for Data admin can register the service on the Cloud Pak for Data cluster. Learn how to register Cloud Pak for Data clusters.

Logs

  • The component logs are stored in /var/log/dsxhi,/var/log/livy, and /var/log/livy2.
  • The component PIDs are stored in /var/log/dsxhi,/var/log/livy, and /var/log/livy2, and /opt/ibm/dsxhi/gateway/logs/.
  • The log level for the gateway service can be set by editing /opt/ibm/dsxhi/gateway/conf/gateway-log4j.properties and setting the appropriate level in thelog4j.logger.org.apache.knox.gateway property.

Configuring the Hadoop integration with Cloud Pak for Data

When you're working with Hadoop clusters, the Hadoop admin might want to implement extra configurations when a user application requests resources from a Hadoop system. For example, these requests might be a YARN queue or Spark application sizes. Configuring Hadoop allows the Hadoop admin more control over jobs that are submitted through Execution Engine for Apache Hadoop by using Execution Engine for Apache Hadoop.

After the additional configurations are defined, a list of parameters and keys are available and data scientists can update the settings from the Hadoop environment.

Important:

If a Hadoop cluster is already registered in Watson Studio, the Watson Studio administrator must navigate to that existing Hadoop registration, and click Refresh to pick up the configuration changes that the Hadoop administrator configured.

Setting up a list of Spark configuration parameters

To implement the additional configurations, you must modify the yarnJobParams*.json file. The files include:

  • Cloudera System: /opt/ibm/dsxhi/conf/yarnJobParamsCDH.json
  • Hortonworks System: /opt/ibm/dsxhi/conf/yarnJobParamsHDP.json

To modify the file:

  1. Determine which file to modify based on your Hadoop cluster.

  2. Back up the file before you modify it.

  3. Modify the file. See these details and examples for more information.

  4. Save the file.

  5. Verify that the content is still a valid json file by using the following command:

    Cloudera System: Run cat yarnJobParamsCDH.json | python -m json.tool

    Hortonworks System: Run cat yarnJobParamsHDP.json | python -m json.tool

    Confirm that the command returns a json formatted object.

  6. If you have a Cloud Pak for Data cluster that has the Hadoop cluster that is registered, ask your Cloud Pak for Data admin to click Refresh on the registration details page. Refreshing allows the new configurations to be retrieved.

Details on the content of the json files

The file contains three main sections:

  • scriptLanguages: These are options that you can use to enable R Script to run on your Hadoop system. R must be enabled on your Hadoop system to use this feature. Confirm that R is installed on all of your Hadoop nodes, and then edit this json file as follows:

        "scriptLanguages": [{

                        "language": "R",

                        "version": "system",

                        "enabled": "true"

                }

        ],
  • jobOptions: These are options that are tied to the UI and should not be removed. The values and bounds can be modified. The description of each entry describes what it is used for. Note that some of these options apply only to Watson Studio Local 1.2.3.x. These are labeled with 1.2.3.x. Some options are labeled yarn. Options labelled yarn are used by Cloud Pak for Data 3.0.1, and can be customized by a Hadoop administrator.

  • extraOptions: These are extra options for Spark that the Hadoop admin can set either with a default value value or not. If value is specified, this value is always used when Cloud Pak for Data issues a call to create a JEG session or livy connection through Data Refinery. When these sessions are created, options are translated to --conf option=value for JEG, and spark_config$option <- value for the livy connection.

If a specific Spark option is not listed, it can be added in as an entry. Review the Spark configuration options for the specific version of Spark 2.x that is running on the Hadoop system.

Example: Providing a list of available queues: default, long_running, short

                {
                        "description": "The YARN queue against which the job should be submitted",
                        "displayName": "List of available YARN queues",
                        "name": "spark.yarn.queue",
                        "type": "enum",
                        "value": "[\"default\", \"long_running\", \"short\"]",
                        "labels": [ "spark" ]
                }

Example: Increase the range of available driver memory to 8196 (allowing the user to request more memory)

                {
                        "description": "Available driver memory",
                        "displayName": "Driver memory (in MB)",
                        "max": "8196",
                        "min": "1024",
                        "name": "spark.driver.memory",
                        "type": "int",
                        "labels": [ "spark" ]
                }

Example: If user needs to tune spark.memory.fraction, add this entry:

               {
                       "name": "spark.memory.fraction",
                       "type": "float",
                       "min" : "0.1",
                       "max" : "0.9999",
                       "value": ""
                }
         if admin determines that the default spark.memory.fraction should always be 0.9, it is set in the value field.
               {
                       "name": "spark.memory.fraction",
                       "type": "float",
                       "min" : "0.1",
                       "max" : "0.9999",
                       "value": "0.9"
                }