Preparing the batch processing environment in IBM Analytics Engine Powered by Apache Spark

To configure batch processing in the Watson OpenScale service with Analytics Engine powered by Apache Spark, you must prepare the deployment environment.

Requirements for Analytics Engine powered by Apache Spark

You must have the following artifacts to configure batch processing with Analytics Engine powered by Apache Spark:

  • An instance of Analytics Engine powered by Apache Spark. When you install Analytics Engine powered by Apache Spark, update your custom resource (CR) definition to include the following lines:

    serviceConfig:
          sparkAdvEnabled: "true"
    
  • An additional volume that is different from the default volume

  • A Cloud Pak for Data API key that provides permissions to write to the volume. This API key is the platform API key that enables services to submit jobs and not write files to the volume.

  • An Apache Hive version 2.3.7 or later database

  • Specialized notebooks to configure batch processing for HIVE or JDBC that you run together with Watson OpenScale

  • A feedback table, training data table, and a payload logging table that you create in the Hive database

  • A drifted_transaction table that stores the transactions that are used for post-processing analysis

Step 1: Create a Python archive of modules that are required for model evaluation

Model evaluations that run on the Hadoop Ecosystem require dependent python packages. Without these packages, the evaluations fail. You can use the following steps to install these dependencies and upload them to a location in the Hadoop Distributed File System (HDFS):

  1. Log in to a Linux operating system where Python is installed and check the version of Python by running the following command:

    python --version
    Python 3.11.9
    
  2. Install the python3-devel package, which installs the GNU Compiler Collection (GCC) libraries, by running the following command:

    yum install python3.11-devel
    
  3. Navigate to the directory where the Python virtual environment is created, such as the /opt folder:

    cd /opt
    
  4. Delete any previously created virtual environment by running the following commands:

    rm -fr wos_env
    rm -fr wos_env.zip
    

    The wos_env folder contains a Python virtual environment with Spark Job dependencies in it.

  5. Create and activate a virtual environment by running the following commands.

    python -m venv wos_env
    source wos_env/bin/activate
    

    The wos_env virtual environment is created and the source command activates the virtual environment.

  6. Upgrade the pip environment, by running the following command:

    pip install --upgrade pip
    
  7. Install all the dependencies with one of the following methods that depend on which version of Watson OpenScale that you use:

    • Run the following commands to install the dependencies:
      • Install the postgresql-devel package:

        yum install postgresql-devel
        
      • Install the dependencies:

        python -m pip install "ibm-metrics-plugin~=5.1.0"
        
  8. Deactivate the virtual environment by running the deactivate command.

Step 2: Upload the archive

Create a compressed file that contains the virtual environment by running the following command:

zip -q -r wos_env.zip wos_env/
ls -alt --block-size=M wos_env.zip

Generate an API authorization token and upload to a volume using the PUT /volumes API command:

curl -k -i -X PUT 'https://<cluster_url>/zen-volumes/<volume_name>/v1/volumes/files/py_packages%2Fwos_env?extract=true' -H "Authorization: ZenApiKey ${TOKEN}"  -H 'cache-control: no-cache' -H 'content-type: multipart/form-data' -F  'upFile=@/<path_to_parent_dir>/wos_env.zip'

Configuring Analytics Engine powered by Apache Spark

To prepare your deployment environment to process large volumes of data, you can increase the resources that the analytics engine uses to process your data for quality evaluations. You can use the following settings to increase resources in Analytics Engine powered by Apache Spark for your Watson OpenScale batch subscription:

  • If your deployment contains at least a million records, use the following settings to increase resources:

    {
    "driver_cores": 1,
    "driver_memory": 2,
    "executor_cores": 4,
    "executor_memory": 6,
    "max_num_executors": 4,
    "min_num_executors": 2
    }
    
  • If your deployment contains at least five million records, use the following settings to increase resources:

    {
    "driver_cores": 3,
    "driver_memory": 6,
    "executor_cores": 4,
    "executor_memory": 6,
    "max_num_executors": 4,
    "min_num_executors": 2
    }
    

Next steps

Configure batch processing in Watson OpenScale

Parent topic: Batch processing in Watson OpenScale