Preparing the batch processing environment in IBM Analytics Engine Powered by Apache Spark
To configure batch processing in the Watson OpenScale service with Analytics Engine powered by Apache Spark, you must prepare the deployment environment.
Requirements for Analytics Engine powered by Apache Spark
You must have the following artifacts to configure batch processing with Analytics Engine powered by Apache Spark:
-
An instance of Analytics Engine powered by Apache Spark. When you install Analytics Engine powered by Apache Spark, update your custom resource (CR) definition to include the following lines:
serviceConfig: sparkAdvEnabled: "true"
-
An additional volume that is different from the default volume
-
A Cloud Pak for Data API key that provides permissions to write to the volume. This API key is the platform API key that enables services to submit jobs and not write files to the volume.
-
An Apache Hive version 2.3.7 or later database
-
Specialized notebooks to configure batch processing for HIVE or JDBC that you run together with Watson OpenScale
-
A feedback table, training data table, and a payload logging table that you create in the Hive database
-
A
drifted_transaction
table that stores the transactions that are used for post-processing analysis
Step 1: Create a Python archive of modules that are required for model evaluation
Model evaluations that run on the Hadoop Ecosystem require dependent python packages. Without these packages, the evaluations fail. You can use the following steps to install these dependencies and upload them to a location in the Hadoop Distributed File System (HDFS):
-
Log in to a Linux operating system where Python is installed and check the version of Python by running the following command:
python --version Python 3.11.9
-
Install the
python3-devel
package, which installs the GNU Compiler Collection (GCC) libraries, by running the following command:yum install python3.11-devel
-
Navigate to the directory where the Python virtual environment is created, such as the
/opt
folder:cd /opt
-
Delete any previously created virtual environment by running the following commands:
rm -fr wos_env rm -fr wos_env.zip
The
wos_env
folder contains a Python virtual environment with Spark Job dependencies in it. -
Create and activate a virtual environment by running the following commands.
python -m venv wos_env source wos_env/bin/activate
The
wos_env
virtual environment is created and thesource
command activates the virtual environment. -
Upgrade the pip environment, by running the following command:
pip install --upgrade pip
-
Install all the dependencies with one of the following methods that depend on which version of Watson OpenScale that you use:
- Run the following commands to install the dependencies:
-
Install the
postgresql-devel
package:yum install postgresql-devel
-
Install the dependencies:
python -m pip install "ibm-metrics-plugin~=5.1.0"
-
- Run the following commands to install the dependencies:
-
Deactivate the virtual environment by running the
deactivate
command.
Step 2: Upload the archive
Create a compressed file that contains the virtual environment by running the following command:
zip -q -r wos_env.zip wos_env/
ls -alt --block-size=M wos_env.zip
Generate an API authorization token and upload to a volume using the PUT /volumes API
command:
curl -k -i -X PUT 'https://<cluster_url>/zen-volumes/<volume_name>/v1/volumes/files/py_packages%2Fwos_env?extract=true' -H "Authorization: ZenApiKey ${TOKEN}" -H 'cache-control: no-cache' -H 'content-type: multipart/form-data' -F 'upFile=@/<path_to_parent_dir>/wos_env.zip'
Configuring Analytics Engine powered by Apache Spark
To prepare your deployment environment to process large volumes of data, you can increase the resources that the analytics engine uses to process your data for quality evaluations. You can use the following settings to increase resources in Analytics Engine powered by Apache Spark for your Watson OpenScale batch subscription:
-
If your deployment contains at least a million records, use the following settings to increase resources:
{ "driver_cores": 1, "driver_memory": 2, "executor_cores": 4, "executor_memory": 6, "max_num_executors": 4, "min_num_executors": 2 }
-
If your deployment contains at least five million records, use the following settings to increase resources:
{ "driver_cores": 3, "driver_memory": 6, "executor_cores": 4, "executor_memory": 6, "max_num_executors": 4, "min_num_executors": 2 }
Next steps
Configure batch processing in Watson OpenScale
Parent topic: Batch processing in Watson OpenScale