Table of contents

Preparing the batch processing environment on the Hadoop Ecosystem

To configure Watson OpenScale to work in batch mode ensure that you have all of the required services and components. You must enable the Watson OpenScale Spark Manager Application.

Requirements for remote Hadoop implementation

There are specific requirements for running batch processing on a remote Hadoop cluster. In addition to the usual requirements for running Watson OpenScale, batch processing requires the following assets:

  • Apache Hive database, requires Apache Hive 2.3.7
  • Apache Hadoop 2.10.0 with Apache Spark 2.4.7
  • A docker engine to build and package the Spark Manager Application
  • A specialized notebook that you run in tandem iteratively with Watson OpenScale
  • A feedback table, training data table, and a payload logging table that you create in the Hive database
  • Livy and WebHDFS are available on the Hadoop Ecosystem, requires Apache Hadoop 2.10.0
  • A set of users who can access this application via Basic Authentication
  • A Python archive containing modules required by Watson OpenScale
  • A drifted_transaction table that stores the transactions that are used for post-processing analysis
  • Optionally, a Kerberized Hadoop Ecosystem for running the application

Step 1: Ensure Livy and WebHDFS are available on the Hadoop Ecosystem

Because the Spark Manager Application uses Livy to submit jobs and WebHDFS to access files written by Watson OpenScale, both Livy and WebHDFS are required. If they aren’t installed, you must request them. Your Hadoop cluster administrator can provide the base URLs for Livy, WebHDFS, and the HDFS file base. You must have that information before completing the remaining steps.

Example of WebHDFS URL: http://$hostname:50070

Example of HDFS file base URL: hdfs://$hostname:9000

Example of Livy URL: http://$hostname:8998

Optionally, request the following items from the Hadoop cluster administrator:

  • A base path on the HDFS file system to which Watson OpenScale can specifically write files
  • A path to the yarn.keytab file, which is specified as conf.spark.yarn.keytab parameter in the job request payload

Step 2: Using a Kerberized Hadoop Ecosystem

Only complete this step if running your application in a Kerberos-enabled Hadoop Ecosystem.

To communicate to a Kerberized Hadoop Ecosystem, the application should generate a Kerberos ticket before it can submit requests to Livy or WebHDFS. To generate a Kerberos ticket you need the following files and ID:

A keytab file
To authenticate to the KDC, Kerberos-enabled machines require a keytab file. The keytab file is an encrypted, local, on-disk copy of the host key. Your Hadoop cluster administrator must provide this keytab file. Request the hdfs.keytab file, which is required for subsequent steps.
A krb5.conf file
The krb5. conf file contains Kerberos configuration information, such as the locations of KDCs and admin servers for the Kerberos realms of interest, defaults for the current realm and for Kerberos applications, and mappings of hostnames onto Kerberos realms. Your Hadoop cluster admin must provide this configuration file. Request the krb5.conf file, which is required for subsequent steps.
A kerberos user principal
A Kerberos Principal represents a unique identity, such as openscale_user1/$hostname@HADOOPCLUSTER.LOCAL in a Kerberos system to which Kerberos can assign tickets to access Kerberos-aware services. Principal names are made up of several components separated by a forward slash ( / ). You can also specify a realm as the last component of the name by using the at symbol ( @ ). Your Hadoop cluster admin must provide the principal ID. Request the principal ID, which is required for subsequent steps. If you need more than one user to be able to submit jobs, please request additional principal IDs.

Step 3: Create an allowlist of users who can access this application via Basic Authentication

You must create an allowlist of usernames and passwords that can access the Spark Manager Application via basic authentication. This allowlist of usernames and passwords must be specified in an auth.json file.

Step 4: Create a Python archive of modules that are required by Watson OpenScale

Watson OpenScale jobs that would run on the Hadoop Ecosystem require a few dependent python packages. Without these packages the jobs would fail. Refer to this section on how to install these dependencies and upload them to a location in HDFS

  1. Log in to a Linux operating system where Python is installed and check the version of Python by running the following command:

    python --version
    Python 3.7.9
    
  2. Install the python3-devel package, which installs the GCC libraires, by running the following command:

    yum install python3-devel
    
  3. Navigate to the directory where the Python virtual environment is created, such as the /opt folder:

    cd /opt

  4. Delete any previously-created virtual environment by running the following command. (In the following example, the wos_env folder contains a Python virtual environment with Watson OpenScale Spark Job dependencies in it.)

    rm -fr wos_env
    rm -fr wos_env.zip
    
  5. Create a virtual environment. In the following example, the name of the environment is wos_env. After you create it, source the virtual environment.

    python -m venv wos_env
    source wos_env/bin/activate
    
  6. Upgrade the pip environment, by running the following command:

    pip install --upgrade pip

  7. To install all the dependencies, choose whether to install them individually or as part of a batch process in a file. They must be installed in the following order.

    • If you’re using Python 3.7, run the following commands in order to install the required files one at a time:

       python -m pip install numpy==1.20.2
       python -m pip install scipy==1.6.3
       python -m pip install pandas==1.2.4
       python -m pip install scikit-learn==0.24.2
       python -m pip install osqp==0.6.1
       python -m pip install cvxpy==1.0.25
       python -m pip install marshmallow==3.11.1
       python -m pip install requests==2.25.1
       python -m pip install jenkspy==0.2.0
       python -m pip install pyparsing==2.4.7
       python -m pip install tqdm==4.60.0
       python -m pip install more_itertools==8.7.0
       python -m pip install tabulate==0.8.9
       python -m pip install py4j==0.10.9.2
       python -m pip install pyarrow==4.0.0
       python -m pip install ibm-wos-utils>4.0.0
      
    • If you’re using Python 3.6, run the following commands in order to install the required files one at a time:

       python -m pip install numpy==1.19.5
       python -m pip install scipy==1.5.4
       python -m pip install scikit-learn==0.24.2
       python -m pip install osqp==0.6.1
       python -m pip install cvxpy==1.0.25
       python -m pip install marshmallow==3.11.1
       python -m pip install requests==2.25.1
       python -m pip install jenkspy==0.2.0
       python -m pip install pyparsing==2.4.7
       python -m pip install tqdm==4.60.0
       python -m pip install more_itertools==8.7.0
       python -m pip install tabulate==0.8.9
       python -m pip install py4j==0.10.9.2
       python -m pip install pyarrow==4.0.0
       python -m pip install ibm-wos-utils>4.0.0
      

      Rather than run each command one by one, you can put all modules into a requirements.txt file and run the command just once.

    • If you’re using Python 3.7, create the requirements.txt file by adding the following lines to the file. Then, run the python -m pip install -r requirements.txt command:

       numpy==1.20.2
       scipy==1.6.3
       pandas==1.2.4
       scikit-learn==0.24.2
       osqp==0.6.1
       cvxpy==1.0.25
       marshmallow==3.11.1
       requests==2.25.1
       jenkspy==0.2.0
       pyparsing==2.4.7
       tqdm==4.60.0
       more_itertools==8.7.0
       tabulate==0.8.9
       py4j==0.10.9.2
       pyarrow==4.0.0
       ibm-wos-utils>4.0.0
      
    • If you’re using Python 3.6, create the requirements.txt file by adding the following lines to the file. Next, run the python -m pip install numpy==1.19.5 command and then run the python -m pip install -r requirements.txt command:

       scipy==1.5.4
       pandas==1.1.5
       scikit-learn==0.24.2
       osqp==0.6.1
       cvxpy==1.0.25
       marshmallow==3.11.1
       requests==2.25.1
       jenkspy==0.2.0
       pyparsing==2.4.7
       tqdm==4.60.0
       more_itertools==8.7.0
       tabulate==0.8.9
       py4j==0.10.9.2
       pyarrow==4.0.0
       ibm-wos-utils>4.0.0
      
  8. Deactivate the virtual environment by running the deactivate command.

Step 5: Upload the archive

You must compress the virtual environment and upload the archive file to the HDFS file system.

  1. Create a zip file comprising of the virtual environment by running the following commands:

    zip -r wos_env.zip wos_env/
    ls -alt --block-size=M wos_env.zip 
    
  2. Authenticate to HDFS, if needed, by running the following command:

    kinit -kt /home/hadoop/keytabs/hdfs.keytab hdfs/$HOST@#REALM
    

    Example: kinit -kt /home/hadoop/keytabs/hdfs.keytab hdfs/lamy1.fyre.server.com@HADOOPCLUSTER.LOCAL

  3. In HDFS, create a folder named py_packages and put the wos_env.zip to that folder by running the following commands:

    hdfs dfs -mkdir /py_packages
    hdfs dfs -put -f wos_env.zip /py_packages
    
  4. Make sure the file got added in HDFS by running the following command:

    hdfs dfs -ls /py_packages
    

    Expected output:

    Found 1 items
    -rw-r--r--   2 hdfs supergroup  303438930 2020-10-07 16:23 /py_packages/wos_env.zip
    

Step 6: Package the application into a docker image

After all the pre-requisites are met, you must package the application into a docker image. You must clone the Spark Manager API for IBM Watson OpenScale GitHub repository that contains the Spark manager reference application that Watson OpenScale requires to run batch processes.

  1. Clone the the Spark Manager API for IBM Watson OpenScale GitHub repository to the wos-spark-manager-api folder.
  2. Navigate to the wos-spark-manager-api folder.
  3. Update the service/security/auth.json file with the allowlist users that you created. The file should be similar to the following template:

    {
      "allowlisted_users": {
        "openscale_user1": "passw0rd",
        "openscale_user2": "passw0rd",
        "openscale_user3": "passw0rd",
        "openscale": "passw0rd"
      }
    }
    
  4. If you are packing this application to be run on a Kerberized Hadoop Cluster, you must perform the following additional steps:

    1. Place the hdfs.keytab and the krb5.conf file in the root of the wos-spark-manager-api folder

    2. Update the service/security/auth.json file with the user_kerberos_mapping mapping. This should be the Kerberos principals that you had asked the cluster administrator to provide to you. The file should be similar to the following template:

      {
        "allowlisted_users": {
          "openscale_user1": "passw0rd",
          "openscale_user2": "passw0rd",
          "openscale_user3": "passw0rd",
          "openscale": "passw0rd"
        },
        "user_kerberos_mapping" : {
          "openscale_user1": "openscale_user1/$hostname@HADOOPCLUSTER.LOCAL",
          "openscale_user2": "openscale_user2/$hostname@HADOOPCLUSTER.LOCAL",
          "openscale_user3": "openscale_user3/$hostname@HADOOPCLUSTER.LOCAL"
        }
      }
      
    3. In the payload/Dockerfile file, uncomment lines 53-55.

  5. Run the following command to build the docker image:

    docker build -t <repository/image_name>:version> -f payload/Dockerfile .
    

    Example command:

    docker build -t name/wos-spark-manager-api:latest -f payload/Dockerfile . 
    
  6. If you run the preceding steps in a virtual machine (VM) other the one where you plan to host the Spark Manager Application in a docker container, perform the following steps to move the docker image to the other VM:

    1. Save the docker image as a tarball by running the following command:

      docker save myimage:latest | gzip > myimage_latest.tar.gz
      

      Example command:

      docker save name/wos-spark-manager-api:latest | gzip > wos-spark-manager-api-latest.tar.gz
      
    2. SCP the tarball to the other remote VM.
    3. Load the tarball by running the following command:

      docker load < myimage_latest.tar.gz
      

      Example command:

      docker load < /opt/wos-spark-manager-api-latest.tar.gz
      

Step 7: Run the application

  1. Log in to the virtual machine (VM) where you are going to run the application. Make sure that the VM has docker installed.

    If this VM is different from where the docker image was built, you must save and load the docker image to this VM.

  2. Compose the environment file that is used to start the docker. You must have all the required values to compose the docker.env file. Refer to the preceding prerequisites.

    1. Create a file named docker.env.
    2. Add the following entries in the docker.env file:

      WEB_HDFS_URL=http://lamy1.fyre.companyserver.com:50070
      HDFS_FILE_BASE_URL=hdfs://lamy1.fyre.companyserver.com:9000
      SPARK_LIVY_URL=http://sheaffer1.fyre.companyserver.com:8998
      BASE_HDFS_LOCATION=sw/openscale
      WOS_ENV_ARCHIVE_LOCATION=hdfs://lamy1.fyre.companyserver.com:9000/py_packages/wos_env.zip#wos_env
      WOS_ENV_SITE_PACKAGES_PATH=./wos_env/wos_env/lib/python3.6/site-packages:
      
    3. If the application should run against a Kerberized Hadoop cluster, add the following additional entries to the docker.env file

      KERBEROS_ENABLED=true
      HDFS_KEYTAB_FILE_PATH=/opt/ibm/wos/python/keytabs/hdfs.keytab
      SPARK_YARN_KEYTAB_FILE_PATH=/home/hadoop/hadoop/etc/hadoop/yarn.keytab
      
  3. List the docker images and obtain the $IMAGE_ID value by running the following command:

    docker images
    
  4. Start the docker container by running the following command:

    docker run --env-file docker.env -p 5000:9443 --name wos-spark-manager-api $IMAGE_ID
    

The APIs are now accessible at the following URL: http://<VM-HOST-NAME>:5000. The swagger documentation is available at the following URL: http://<VM-HOST-NAME>:5000/spark_wrapper/api/explorer.

Next steps

You are now ready to configure the batch processor. For more information, see Configuring the batch processor.