Preparing the batch processing environment on the Hadoop Ecosystem
To configure batch processing on the Hadoop Ecosystem, you must prepare the deployment environment.
Prerequisites
You must have the following artifacts to configure batch processing on the Hadoop Ecosystem:
- An Apache Hive version 2.3.7 or later database
- Apache Hadoop 2.10.0 with Apache Spark 2.4.7
- A docker engine to build and package the Spark Manager Application
- A specialized notebook that you run together for model evaluations
- A feedback table, training data table, and a payload logging table that you create in the Hive database
- Livy and WebHDFS are available on the Hadoop Ecosystem, requires Apache Hadoop 2.10.0
- A set of users who can access this application through Basic Authentication
- A Python archive that contains modules that are required for model evaluation
- A
drifted_transaction
table that stores the transactions that are used for post-processing analysis - Optionally, a Kerberized Hadoop Ecosystem for running the application
Step 1: Ensure Livy and WebHDFS are available on the Hadoop Ecosystem
Because the Spark Manager Application uses Livy to submit jobs and WebHDFS to access files, you must install Livy and WebHDFS to access the Hadoop ecosystem. If Livy and WebHDFS aren't installed, you must request them. Your Hadoop cluster administrator can provide the base URLs for Livy, WebHDFS, and the HDFS file base. You must have these URLs to prepare the deployment environment. The following examples show base URLs for Livy, WebHDFS, and the HDFS file base:
-
WebHDFS URL:
http://$hostname:50070
-
HDFS file base URL:
hdfs://$hostname:9000
-
Livy URL:
http://$hostname:8998
You can also request the following items from the Hadoop cluster administrator:
- A base path on the HDFS file system that you can write files to
- A path to the yarn.keytab file, which is specified as the
conf.spark.yarn.keytab
parameter in the job request payload
Step 2: Using a Kerberized Hadoop Ecosystem
If you run your application in a Kerberized Hadoop Ecosystem, the application must generate a Kerberos ticket before it can submit requests to Livy or WebHDFS. To generate a Kerberos ticket, you need the following files and ID:
-
A keytab file
-
A krb5.conf file
-
A kerberos user principal
To authenticate to the Key Distribution Center (KDC), Kerberos-enabled machines require a keytab file. The keytab file is an encrypted, local, on-disk copy of the host key. Your Hadoop cluster administrator must provide this keytab file. You
must request the hdfs.keytab
file to prepare your deployment environment.
The krb5. conf
file contains Kerberos configuration information, such as the locations of KDCs and admin servers for Kerberos realms. The file also contains defaults for the current realm and for Kerberos applications and mappings
of hostnames onto Kerberos realms. Your Hadoop cluster administrator must provide this configuration file. You must request the krb5.conf
file to prepare your deployment environment.
A Kerberos principal represents a unique identity, such as openscale_user1/$hostname@HADOOPCLUSTER.LOCAL
in a Kerberos system that Kerberos can assign tickets to access Kerberos-aware services. Principal names contain several components
that are separated by a forward slash ( / ). You can also specify a realm as the last component of the name by using the @ symbol. Your Hadoop cluster administrator must provide the principal ID. You must request the principal ID to prepare
your deployment environment. If you need multiple users to tsubmit jobs, you must request more principal IDs.
Step 3: Create an allowlist of users who can access this application through Basic Authentication
You must create an allowlist of usernames and passwords that can access the Spark Manager Application through basic authentication. This allowlist of usernames and passwords must be specified in an auth.json
file.
Step 4: Create a Python archive of modules that are required for model evaluation
Model evaluations that run on the Hadoop Ecosystem require dependent python packages. Without these packages, the evaluations fail. You can use the following steps to install these dependencies and upload them to a location in the Hadoop Distributed File System (HDFS).
-
Log in to a Linux operating system where Python is installed and check the Python version by running the following command:
python --version Python 3.7.9
-
Install the
python3-devel
package, which installs the GNU Compiler Collection (GCC) libraries, by running the following command:yum install python3-devel
-
Navigate to the directory where the Python virtual environment is created, such as the
/opt
folder:cd /opt
-
Delete any previously created virtual environment by running the following command.
rm -fr wos_env rm -fr wos_env.zip
The
wos_env
folder contains a Python virtual environment with Spark Job dependencies in it. -
Create and activate virtual environment by running the following commands.
python -m venv wos_env source wos_env/bin/activate
The
wos_env
virtual environment is created and thesource
command activates the virtual environment. -
Upgrade the pip environment, by running the following command:
pip install --upgrade pip
-
To install all the dependencies, choose whether to install them individually or with a batch process in a file. They must be installed in the following order.
-
If you're using Python 3.7, run the following commands to install the required files individually:
python -m pip install numpy==1.20.2 python -m pip install scipy==1.6.3 python -m pip install pandas==1.2.4 python -m pip install scikit-learn==0.24.2 python -m pip install osqp==0.6.1 python -m pip install cvxpy==1.0.25 python -m pip install marshmallow==3.11.1 python -m pip install requests==2.25.1 python -m pip install jenkspy==0.2.0 python -m pip install pyparsing==2.4.7 python -m pip install tqdm==4.60.0 python -m pip install more_itertools==8.7.0 python -m pip install tabulate==0.8.9 python -m pip install py4j==0.10.9.2 python -m pip install pyarrow==4.0.0 python -m pip install "ibm-wos-utils==4.6.*"
You can put all of the modules into a
requirements.txt
file and run the commands simultaneously.-
If you're using Python 3.7, create the
requirements.txt
file by adding the following lines to the file. Then, run thepython -m pip install -r requirements.txt
command:numpy==1.20.2 scipy==1.6.3 pandas==1.2.4 osqp==0.6.1 cvxpy==1.0.25 marshmallow==3.11.1 requests==2.25.1 jenkspy==0.2.0 pyparsing==2.4.7 tqdm==4.60.0 more_itertools==8.7.0 tabulate==0.8.9 py4j==0.10.9.2 pyarrow==4.0.0 ibm-wos-utils==4.6.*
-
-
Deactivate the virtual environment by running the
deactivate
command.
Step 5: Upload the archive
You must compress the virtual environment and upload the archive file to HDFS.
-
Create a compressed file consisting of the virtual environment by running the following commands:
zip -r wos_env.zip wos_env/ ls -alt --block-size=M wos_env.zip
-
Authenticate to HDFS by running the following command:
kinit -kt /home/hadoop/keytabs/hdfs.keytab hdfs/$HOST@#REALM
-
In HDFS, create a
py_packages
folder and add thewos_env.zip
file to that folder by running the following commands:hdfs dfs -mkdir /py_packages hdfs dfs -put -f wos_env.zip /py_packages
-
Verify that you successfully added the file to HDFS by running the following command:
hdfs dfs -ls /py_packages
The following output displays when you successfully add the file:
Found 1 items -rw-r--r-- 2 hdfs supergroup 303438930 2020-10-07 16:23 /py_packages/wos_env.zip
Step 6: Package the application into a docker image
You must package the application into a docker image and clone the GitHub repository that contains the Spark manager reference application for batch processes.
-
Clone the GitHub repository to the wos-spark-manager-api folder.
-
Navigate to the wos-spark-manager-api folder.
-
Update the
service/security/auth.json
file with the allowlist users that you created. The updated file looks similar to the following template:{ "allowlisted_users": { "openscale_user1": "passw0rd", "openscale_user2": "passw0rd", "openscale_user3": "passw0rd", "openscale": "passw0rd" } }
-
If you are packing this application to run on a Kerberized Hadoop Cluster, you must also complete the following steps:
-
Place the
hdfs.keytab
and thekrb5.conf
file in the root of thewos-spark-manager-api
folder -
Update the
service/security/auth.json
file with theuser_kerberos_mapping
mapping that contains the Kerberos principals that are provided to you by the adminstrator. The file is similar to the following template:{ "allowlisted_users": { "openscale_user1": "passw0rd", "openscale_user2": "passw0rd", "openscale_user3": "passw0rd", "openscale": "passw0rd" }, "user_kerberos_mapping" : { "openscale_user1": "openscale_user1/$hostname@HADOOPCLUSTER.LOCAL", "openscale_user2": "openscale_user2/$hostname@HADOOPCLUSTER.LOCAL", "openscale_user3": "openscale_user3/$hostname@HADOOPCLUSTER.LOCAL" } }
-
In the
payload/Dockerfile
file, uncomment lines 53-55.
-
-
Run the following command to build the docker image:
docker build -t <repository/image_name>:version> -f payload/Dockerfile .
-
If you use a new virtual machine (VM) to package your application, complete the following steps to move the docker image to your new VM:
-
Save the docker image as a tarball by running the following command:
docker save myimage:latest | gzip > myimage_latest.tar.gz
-
SCP the tarball to the other remote VM.
-
Load the tarball by running the following command:
docker load < myimage_latest.tar.gz
-
Step 7: Run the application
-
Log in to the virtual machine (VM) where you are going to run the application. Verify that the VM has installed docker.
If this VM is different from where the docker image was built, you must save and load the docker image to this VM.
-
Compose the environment file that is used to start the docker. You must have all the required values to compose the
docker.env
file.-
Create a
docker.env
file. -
Add the following entries in the
docker.env
file:WEB_HDFS_URL=http://lamy1.fyre.companyserver.com:50070 HDFS_FILE_BASE_URL=hdfs://lamy1.fyre.companyserver.com:9000 SPARK_LIVY_URL=http://sheaffer1.fyre.companyserver.com:8998 BASE_HDFS_LOCATION=sw/openscale WOS_ENV_ARCHIVE_LOCATION=hdfs://lamy1.fyre.companyserver.com:9000/py_packages/wos_env.zip#wos_env WOS_ENV_SITE_PACKAGES_PATH=./wos_env/wos_env/lib/python3.6/site-packages:
-
If the application runs against a Kerberized Hadoop cluster, add the following additional entries to the
docker.env
fileKERBEROS_ENABLED=true HDFS_KEYTAB_FILE_PATH=/opt/ibm/wos/python/keytabs/hdfs.keytab SPARK_YARN_KEYTAB_FILE_PATH=/home/hadoop/hadoop/etc/hadoop/yarn.keytab
-
-
List the docker images and obtain the
$IMAGE_ID
value by running the following command:docker images
-
Start the docker container by running the following command:
docker run --env-file docker.env -p 5000:9443 --name wos-spark-manager-api $IMAGE_ID
You can now access the APIs with the following URL: http://<VM-HOST-NAME>:5000
. You can also now view the swagger documentation with the following URL: http://<VM-HOST-NAME>:5000/spark_wrapper/api/explorer
.
Next steps
Configure batch processing in Watson OpenScale
Parent topic: Batch processing in Watson OpenScale