Preparing the batch processing environment on the Hadoop Ecosystem

To configure batch processing on the Hadoop Ecosystem, you must prepare the deployment environment.

Prerequisites

You must have the following artifacts to configure batch processing on the Hadoop Ecosystem:

  • An Apache Hive version 2.3.7 or later database
  • Apache Hadoop 2.10.0 with Apache Spark 2.4.7
  • A docker engine to build and package the Spark Manager Application
  • A specialized notebook that you run together for model evaluations
  • A feedback table, training data table, and a payload logging table that you create in the Hive database
  • Livy and WebHDFS are available on the Hadoop Ecosystem, requires Apache Hadoop 2.10.0
  • A set of users who can access this application through Basic Authentication
  • A Python archive that contains modules that are required for model evaluation
  • A drifted_transaction table that stores the transactions that are used for post-processing analysis
  • Optionally, a Kerberized Hadoop Ecosystem for running the application

Step 1: Ensure Livy and WebHDFS are available on the Hadoop Ecosystem

Because the Spark Manager Application uses Livy to submit jobs and WebHDFS to access files, you must install Livy and WebHDFS to access the Hadoop ecosystem. If Livy and WebHDFS aren't installed, you must request them. Your Hadoop cluster administrator can provide the base URLs for Livy, WebHDFS, and the HDFS file base. You must have these URLs to prepare the deployment environment. The following examples show base URLs for Livy, WebHDFS, and the HDFS file base:

  • WebHDFS URL: http://$hostname:50070

  • HDFS file base URL: hdfs://$hostname:9000

  • Livy URL: http://$hostname:8998

You can also request the following items from the Hadoop cluster administrator:

  • A base path on the HDFS file system that you can write files to
  • A path to the yarn.keytab file, which is specified as the conf.spark.yarn.keytab parameter in the job request payload

Step 2: Using a Kerberized Hadoop Ecosystem

If you run your application in a Kerberized Hadoop Ecosystem, the application must generate a Kerberos ticket before it can submit requests to Livy or WebHDFS. To generate a Kerberos ticket, you need the following files and ID:

  • A keytab file

  • A krb5.conf file

  • A kerberos user principal

To authenticate to the Key Distribution Center (KDC), Kerberos-enabled machines require a keytab file. The keytab file is an encrypted, local, on-disk copy of the host key. Your Hadoop cluster administrator must provide this keytab file. You must request the hdfs.keytab file to prepare your deployment environment.

The krb5. conf file contains Kerberos configuration information, such as the locations of KDCs and admin servers for Kerberos realms. The file also contains defaults for the current realm and for Kerberos applications and mappings of hostnames onto Kerberos realms. Your Hadoop cluster administrator must provide this configuration file. You must request the krb5.conf file to prepare your deployment environment.

A Kerberos principal represents a unique identity, such as openscale_user1/$hostname@HADOOPCLUSTER.LOCAL in a Kerberos system that Kerberos can assign tickets to access Kerberos-aware services. Principal names contain several components that are separated by a forward slash ( / ). You can also specify a realm as the last component of the name by using the @ symbol. Your Hadoop cluster administrator must provide the principal ID. You must request the principal ID to prepare your deployment environment. If you need multiple users to tsubmit jobs, you must request more principal IDs.

Step 3: Create an allowlist of users who can access this application through Basic Authentication

You must create an allowlist of usernames and passwords that can access the Spark Manager Application through basic authentication. This allowlist of usernames and passwords must be specified in an auth.json file.

Step 4: Create a Python archive of modules that are required for model evaluation

Model evaluations that run on the Hadoop Ecosystem require dependent python packages. Without these packages, the evaluations fail. You can use the following steps to install these dependencies and upload them to a location in the Hadoop Distributed File System (HDFS).

  1. Log in to a Linux operating system where Python is installed and check the Python version by running the following command:

    python --version
    Python 3.7.9
    
  2. Install the python3-devel package, which installs the GNU Compiler Collection (GCC) libraries, by running the following command:

    yum install python3-devel
    
  3. Navigate to the directory where the Python virtual environment is created, such as the /opt folder:

    cd /opt

  4. Delete any previously created virtual environment by running the following command.

    rm -fr wos_env
    rm -fr wos_env.zip
    

    The wos_env folder contains a Python virtual environment with Spark Job dependencies in it.

  5. Create and activate virtual environment by running the following commands.

    python -m venv wos_env
    source wos_env/bin/activate
    

    The wos_env virtual environment is created and the source command activates the virtual environment.

  6. Upgrade the pip environment, by running the following command:

    pip install --upgrade pip
    
  7. To install all the dependencies, choose whether to install them individually or with a batch process in a file. They must be installed in the following order.

    • If you're using Python 3.7, run the following commands to install the required files individually:

      python -m pip install numpy==1.20.2
      python -m pip install scipy==1.6.3
      python -m pip install pandas==1.2.4
      python -m pip install scikit-learn==0.24.2
      python -m pip install osqp==0.6.1
      python -m pip install cvxpy==1.0.25
      python -m pip install marshmallow==3.11.1
      python -m pip install requests==2.25.1
      python -m pip install jenkspy==0.2.0
      python -m pip install pyparsing==2.4.7
      python -m pip install tqdm==4.60.0
      python -m pip install more_itertools==8.7.0
      python -m pip install tabulate==0.8.9
      python -m pip install py4j==0.10.9.2
      python -m pip install pyarrow==4.0.0
      python -m pip install "ibm-wos-utils==4.6.*"
      

    You can put all of the modules into a requirements.txt file and run the commands simultaneously.

    • If you're using Python 3.7, create the requirements.txt file by adding the following lines to the file. Then, run the python -m pip install -r requirements.txt command:

      numpy==1.20.2
      scipy==1.6.3
      pandas==1.2.4
      osqp==0.6.1
      cvxpy==1.0.25
      marshmallow==3.11.1
      requests==2.25.1
      jenkspy==0.2.0
      pyparsing==2.4.7
      tqdm==4.60.0
      more_itertools==8.7.0
      tabulate==0.8.9
      py4j==0.10.9.2
      pyarrow==4.0.0
      ibm-wos-utils==4.6.*
      
  8. Deactivate the virtual environment by running the deactivate command.

Step 5: Upload the archive

You must compress the virtual environment and upload the archive file to HDFS.

  1. Create a compressed file consisting of the virtual environment by running the following commands:

    zip -r wos_env.zip wos_env/
    ls -alt --block-size=M wos_env.zip
    
  2. Authenticate to HDFS by running the following command:

    kinit -kt /home/hadoop/keytabs/hdfs.keytab hdfs/$HOST@#REALM
    
  3. In HDFS, create a py_packages folder and add the wos_env.zip file to that folder by running the following commands:

    hdfs dfs -mkdir /py_packages
    hdfs dfs -put -f wos_env.zip /py_packages
    
  4. Verify that you successfully added the file to HDFS by running the following command:

    hdfs dfs -ls /py_packages
    

    The following output displays when you successfully add the file:

    Found 1 items
    -rw-r--r--   2 hdfs supergroup  303438930 2020-10-07 16:23 /py_packages/wos_env.zip
    

Step 6: Package the application into a docker image

You must package the application into a docker image and clone the GitHub repository that contains the Spark manager reference application for batch processes.

  1. Clone the GitHub repository to the wos-spark-manager-api folder.

  2. Navigate to the wos-spark-manager-api folder.

  3. Update the service/security/auth.json file with the allowlist users that you created. The updated file looks similar to the following template:

    {
      "allowlisted_users": {
        "openscale_user1": "passw0rd",
        "openscale_user2": "passw0rd",
        "openscale_user3": "passw0rd",
        "openscale": "passw0rd"
      }
    }
    
  4. If you are packing this application to run on a Kerberized Hadoop Cluster, you must also complete the following steps:

    1. Place the hdfs.keytab and the krb5.conf file in the root of the wos-spark-manager-api folder

    2. Update the service/security/auth.json file with the user_kerberos_mapping mapping that contains the Kerberos principals that are provided to you by the adminstrator. The file is similar to the following template:

      {
        "allowlisted_users": {
          "openscale_user1": "passw0rd",
          "openscale_user2": "passw0rd",
          "openscale_user3": "passw0rd",
          "openscale": "passw0rd"
        },
        "user_kerberos_mapping" : {
          "openscale_user1": "openscale_user1/$hostname@HADOOPCLUSTER.LOCAL",
          "openscale_user2": "openscale_user2/$hostname@HADOOPCLUSTER.LOCAL",
          "openscale_user3": "openscale_user3/$hostname@HADOOPCLUSTER.LOCAL"
        }
      }
      
    3. In the payload/Dockerfile file, uncomment lines 53-55.

  5. Run the following command to build the docker image:

    docker build -t <repository/image_name>:version> -f payload/Dockerfile .
    
  6. If you use a new virtual machine (VM) to package your application, complete the following steps to move the docker image to your new VM:

    1. Save the docker image as a tarball by running the following command:

      docker save myimage:latest | gzip > myimage_latest.tar.gz
      
    2. SCP the tarball to the other remote VM.

    3. Load the tarball by running the following command:

      docker load < myimage_latest.tar.gz
      

Step 7: Run the application

  1. Log in to the virtual machine (VM) where you are going to run the application. Verify that the VM has installed docker.

    If this VM is different from where the docker image was built, you must save and load the docker image to this VM.

  2. Compose the environment file that is used to start the docker. You must have all the required values to compose the docker.env file.

    1. Create a docker.env file.

    2. Add the following entries in the docker.env file:

      WEB_HDFS_URL=http://lamy1.fyre.companyserver.com:50070
      HDFS_FILE_BASE_URL=hdfs://lamy1.fyre.companyserver.com:9000
      SPARK_LIVY_URL=http://sheaffer1.fyre.companyserver.com:8998
      BASE_HDFS_LOCATION=sw/openscale
      WOS_ENV_ARCHIVE_LOCATION=hdfs://lamy1.fyre.companyserver.com:9000/py_packages/wos_env.zip#wos_env
      WOS_ENV_SITE_PACKAGES_PATH=./wos_env/wos_env/lib/python3.6/site-packages:
      
    3. If the application runs against a Kerberized Hadoop cluster, add the following additional entries to the docker.env file

      KERBEROS_ENABLED=true
      HDFS_KEYTAB_FILE_PATH=/opt/ibm/wos/python/keytabs/hdfs.keytab
      SPARK_YARN_KEYTAB_FILE_PATH=/home/hadoop/hadoop/etc/hadoop/yarn.keytab
      
  3. List the docker images and obtain the $IMAGE_ID value by running the following command:

    docker images
    
  4. Start the docker container by running the following command:

    docker run --env-file docker.env -p 5000:9443 --name wos-spark-manager-api $IMAGE_ID
    

You can now access the APIs with the following URL: http://<VM-HOST-NAME>:5000. You can also now view the swagger documentation with the following URL: http://<VM-HOST-NAME>:5000/spark_wrapper/api/explorer.

Next steps

Configure batch processing in Watson OpenScale

Parent topic: Batch processing in Watson OpenScale