Preparing the batch processing environment on the Hadoop Ecosystem
To configure batch processing on the Hadoop Ecosystem, you must prepare the deployment environment.
Prerequisites
You must have the following artifacts to configure batch processing on the Hadoop Ecosystem:
- An Apache Hive version 2.3.7 or later database
- Apache Hadoop 2.10.0 with Apache Spark 2.4.7
- A docker engine to build and package the Spark Manager Application
- A specialized notebook that you run together for model evaluations
- A feedback table, training data table, and a payload logging table that you create in the Hive database
- Livy and WebHDFS are available on the Hadoop Ecosystem, requires Apache Hadoop 2.10.0
- A set of users who can access this application through Basic Authentication
- A Python archive that contains modules that are required for model evaluation
- A
drifted_transactiontable that stores the transactions that are used for post-processing analysis - Optionally, a Kerberized Hadoop Ecosystem for running the application
Step 1: Ensure Livy and WebHDFS are available on the Hadoop Ecosystem
Because the Spark Manager Application uses Livy to submit jobs and WebHDFS to access files, you must install Livy and WebHDFS to access the Hadoop ecosystem. If Livy and WebHDFS aren't installed, you must request them. Your Hadoop cluster administrator can provide the base URLs for Livy, WebHDFS, and the HDFS file base. You must have these URLs to prepare the deployment environment. The following examples show base URLs for Livy, WebHDFS, and the HDFS file base:
-
WebHDFS URL:
http://$hostname:50070 -
HDFS file base URL:
hdfs://$hostname:9000 -
Livy URL:
http://$hostname:8998
You can also request the following items from the Hadoop cluster administrator:
- A base path on the HDFS file system that you can write files to
- A path to the yarn.keytab file, which is specified as the
conf.spark.yarn.keytabparameter in the job request payload
Step 2: Using a Kerberized Hadoop Ecosystem
If you run your application in a Kerberized Hadoop Ecosystem, the application must generate a Kerberos ticket before it can submit requests to Livy or WebHDFS. To generate a Kerberos ticket, you need the following files and ID:
-
A keytab file
-
A krb5.conf file
-
A kerberos user principal
To authenticate to the Key Distribution Center (KDC), Kerberos-enabled machines require a keytab file. The keytab file is an encrypted, local, on-disk copy of the host key. Your Hadoop cluster administrator must provide this keytab file. You
must request the hdfs.keytab file to prepare your deployment environment.
The krb5. conf file contains Kerberos configuration information, such as the locations of KDCs and admin servers for Kerberos realms. The file also contains defaults for the current realm and for Kerberos applications and mappings
of hostnames onto Kerberos realms. Your Hadoop cluster administrator must provide this configuration file. You must request the krb5.conf file to prepare your deployment environment.
A Kerberos principal represents a unique identity, such as openscale_user1/$hostname@HADOOPCLUSTER.LOCAL in a Kerberos system that Kerberos can assign tickets to access Kerberos-aware services. Principal names contain several components
that are separated by a forward slash ( / ). You can also specify a realm as the last component of the name by using the @ symbol. Your Hadoop cluster administrator must provide the principal ID. You must request the principal ID to prepare
your deployment environment. If you need multiple users to tsubmit jobs, you must request more principal IDs.
Step 3: Create an allowlist of users who can access this application through Basic Authentication
You must create an allowlist of usernames and passwords that can access the Spark Manager Application through basic authentication. This allowlist of usernames and passwords must be specified in an auth.json file.
Step 4: Create a Python archive of modules that are required for model evaluation
Model evaluations that run on the Hadoop Ecosystem require dependent python packages. Without these packages, the evaluations fail. You can use the following steps to install these dependencies and upload them to a location in the Hadoop Distributed File System (HDFS).
-
Log in to a Linux operating system where Python is installed and check the Python version by running the following command:
python --version Python 3.7.9 -
Install the
python3-develpackage, which installs the GNU Compiler Collection (GCC) libraries, by running the following command:yum install python3-devel -
Navigate to the directory where the Python virtual environment is created, such as the
/optfolder:cd /opt -
Delete any previously created virtual environment by running the following command.
rm -fr wos_env rm -fr wos_env.zipThe
wos_envfolder contains a Python virtual environment with Spark Job dependencies in it. -
Create and activate virtual environment by running the following commands.
python -m venv wos_env source wos_env/bin/activateThe
wos_envvirtual environment is created and thesourcecommand activates the virtual environment. -
Upgrade the pip environment, by running the following command:
pip install --upgrade pip -
To install all the dependencies, choose whether to install them individually or with a batch process in a file. They must be installed in the following order.
-
If you're using Python 3.7, run the following commands to install the required files individually:
python -m pip install numpy==1.20.2 python -m pip install scipy==1.6.3 python -m pip install pandas==1.2.4 python -m pip install scikit-learn==0.24.2 python -m pip install osqp==0.6.1 python -m pip install cvxpy==1.0.25 python -m pip install marshmallow==3.11.1 python -m pip install requests==2.25.1 python -m pip install jenkspy==0.2.0 python -m pip install pyparsing==2.4.7 python -m pip install tqdm==4.60.0 python -m pip install more_itertools==8.7.0 python -m pip install tabulate==0.8.9 python -m pip install py4j==0.10.9.2 python -m pip install pyarrow==4.0.0 python -m pip install "ibm-wos-utils==4.6.*"
You can put all of the modules into a
requirements.txtfile and run the commands simultaneously.-
If you're using Python 3.7, create the
requirements.txtfile by adding the following lines to the file. Then, run thepython -m pip install -r requirements.txtcommand:numpy==1.20.2 scipy==1.6.3 pandas==1.2.4 osqp==0.6.1 cvxpy==1.0.25 marshmallow==3.11.1 requests==2.25.1 jenkspy==0.2.0 pyparsing==2.4.7 tqdm==4.60.0 more_itertools==8.7.0 tabulate==0.8.9 py4j==0.10.9.2 pyarrow==4.0.0 ibm-wos-utils==4.6.*
-
-
Deactivate the virtual environment by running the
deactivatecommand.
Step 5: Upload the archive
You must compress the virtual environment and upload the archive file to HDFS.
-
Create a compressed file consisting of the virtual environment by running the following commands:
zip -r wos_env.zip wos_env/ ls -alt --block-size=M wos_env.zip -
Authenticate to HDFS by running the following command:
kinit -kt /home/hadoop/keytabs/hdfs.keytab hdfs/$HOST@#REALM -
In HDFS, create a
py_packagesfolder and add thewos_env.zipfile to that folder by running the following commands:hdfs dfs -mkdir /py_packages hdfs dfs -put -f wos_env.zip /py_packages -
Verify that you successfully added the file to HDFS by running the following command:
hdfs dfs -ls /py_packagesThe following output displays when you successfully add the file:
Found 1 items -rw-r--r-- 2 hdfs supergroup 303438930 2020-10-07 16:23 /py_packages/wos_env.zip
Step 6: Package the application into a docker image
You must package the application into a docker image and clone the GitHub repository that contains the Spark manager reference application for batch processes.
-
Clone the GitHub repository to the wos-spark-manager-api folder.
-
Navigate to the wos-spark-manager-api folder.
-
Update the
service/security/auth.jsonfile with the allowlist users that you created. The updated file looks similar to the following template:{ "allowlisted_users": { "openscale_user1": "passw0rd", "openscale_user2": "passw0rd", "openscale_user3": "passw0rd", "openscale": "passw0rd" } } -
If you are packing this application to run on a Kerberized Hadoop Cluster, you must also complete the following steps:
-
Place the
hdfs.keytaband thekrb5.conffile in the root of thewos-spark-manager-apifolder -
Update the
service/security/auth.jsonfile with theuser_kerberos_mappingmapping that contains the Kerberos principals that are provided to you by the adminstrator. The file is similar to the following template:{ "allowlisted_users": { "openscale_user1": "passw0rd", "openscale_user2": "passw0rd", "openscale_user3": "passw0rd", "openscale": "passw0rd" }, "user_kerberos_mapping" : { "openscale_user1": "openscale_user1/$hostname@HADOOPCLUSTER.LOCAL", "openscale_user2": "openscale_user2/$hostname@HADOOPCLUSTER.LOCAL", "openscale_user3": "openscale_user3/$hostname@HADOOPCLUSTER.LOCAL" } } -
In the
payload/Dockerfilefile, uncomment lines 53-55.
-
-
Run the following command to build the docker image:
docker build -t <repository/image_name>:version> -f payload/Dockerfile . -
If you use a new virtual machine (VM) to package your application, complete the following steps to move the docker image to your new VM:
-
Save the docker image as a tarball by running the following command:
docker save myimage:latest | gzip > myimage_latest.tar.gz -
SCP the tarball to the other remote VM.
-
Load the tarball by running the following command:
docker load < myimage_latest.tar.gz
-
Step 7: Run the application
-
Log in to the virtual machine (VM) where you are going to run the application. Verify that the VM has installed docker.
If this VM is different from where the docker image was built, you must save and load the docker image to this VM.
-
Compose the environment file that is used to start the docker. You must have all the required values to compose the
docker.envfile.-
Create a
docker.envfile. -
Add the following entries in the
docker.envfile:WEB_HDFS_URL=http://lamy1.fyre.companyserver.com:50070 HDFS_FILE_BASE_URL=hdfs://lamy1.fyre.companyserver.com:9000 SPARK_LIVY_URL=http://sheaffer1.fyre.companyserver.com:8998 BASE_HDFS_LOCATION=sw/openscale WOS_ENV_ARCHIVE_LOCATION=hdfs://lamy1.fyre.companyserver.com:9000/py_packages/wos_env.zip#wos_env WOS_ENV_SITE_PACKAGES_PATH=./wos_env/wos_env/lib/python3.6/site-packages: -
If the application runs against a Kerberized Hadoop cluster, add the following additional entries to the
docker.envfileKERBEROS_ENABLED=true HDFS_KEYTAB_FILE_PATH=/opt/ibm/wos/python/keytabs/hdfs.keytab SPARK_YARN_KEYTAB_FILE_PATH=/home/hadoop/hadoop/etc/hadoop/yarn.keytab
-
-
List the docker images and obtain the
$IMAGE_IDvalue by running the following command:docker images -
Start the docker container by running the following command:
docker run --env-file docker.env -p 5000:9443 --name wos-spark-manager-api $IMAGE_ID
You can now access the APIs with the following URL: http://<VM-HOST-NAME>:5000. You can also now view the swagger documentation with the following URL: http://<VM-HOST-NAME>:5000/spark_wrapper/api/explorer.
Next steps
Configure batch processing in Watson OpenScale
Parent topic: Batch processing in Watson OpenScale