Preparing the batch processing environment in IBM Analytics Engine Powered by Apache Spark on zLinux

To configure batch processing in the Watson OpenScale service on zLinux, you must prepare the deployment environment.

Requirements for Analytics Engine powered by Apache Spark on zLinux

You must install Conda version 4.12.0 with Python version 3.10. The tool can be installed with Miniconda.

Limitations

Support for the zLinux platform has the following limitations:

  • Only scikit-learn, XGBoost frameworks, and Python functions are supported for IBM Watson Machine Learning.
  • Datamart databases must be created with a larger page size than the default value to work well with wide datasets, as shown in the following example:
   CREATE DB {db name} PAGESIZE {PAGESIZE integer} (8192 for 8K or more)
  • Hive tables that are created with the ORC format are not supported while monitoring batch subscriptions with IBM Analytics Engine and Hive.

Step 1: Create a Conda archive of modules that are required for model evaluation

Model evaluations that run on the Hadoop Ecosystem require some dependent Python packages. Without these packages, the evaluations fail. You can use the following steps to install these dependencies and upload them to a location in the Hadoop file system (HDFS):

  1. Install Conda on the zLinux operating system. Red Hat Enterprise Linux versions 8.x are recommended for building conda packages with open-ce.

  2. Create a conda environment to build the llvmlite virtual environment as a conda package for the s390x zLinux server instance.

    cd /opt
    conda create -y -n llvmlite-env python=3.10
    conda activate llvmlite-env
    conda install -y conda-build
    conda install -y -c open-ce open-ce-builder
    

    a. Run the following commands to enable the llvmlite conda package build with an open-ce environment YAML file:

    git clone -b open-ce-r1.9 https://github.com/open-ce/open-ce.git
    cd open-ce
    yum install patch
    echo "packages:" > envs/llvmlite-env.yaml
    grep -A1 llvmlite envs/opence-env.yaml >> envs/llvmlite-env.yaml
    

    b. Build the llvmlite conda package:

    open-ce build env --python_versions 3.10 --build_types cpu envs/llvmlite-env.yaml
    

    c. Check the condabuild folder to find the built conda package:

    # ls -1 condabuild/
    channeldata.json
    index.html
    linux-s390x
    noarch
    opence-conda-env-py3.10-cpu-openmpi.yaml
    
    # ls -1 condabuild/linux-s390x/
    current_repodata.json
    current_repodata.json.bz2
    index.html
    llvmlite-0.40.1-py310h011d4d7_0.conda
    repodata_from_packages.json
    repodata_from_packages.json.bz2
    repodata.json
    repodata.json.bz2
    

    d. Run the conda index command on the condabuild directory:

    conda index condabuild
    

    e. Add the following parameters in the ~/.condarc file to use the condabuild folder as a local conda channel:

    channels:
    - /opt/open-ce/condabuild
    - defaults
    

    f. Check the Conda channel to find the llvmlite conda package:

    conda search llvmlite
    

    g. Deactivate the llvmlite-env conda virtual environment:

    conda deactivate
    
  3. Create a Conda virtual environment.

    conda create -n wos_env python=3.10.13
    conda activate wos_env
    

    Then, activate the Conda virtual environment.

  4. To install all of the dependencies that are available on the conda channels, create the conda_requirements.txt file by adding the following lines to the file:

    boto3==1.24.28
    botocore==1.27.59
    cython==0.29.36
    h5py==3.7.0
    hdf5==1.10.6
    jmespath==0.10.0
    joblib==1.1.1
    ld_impl_linux-s390x==2.38
    libgcc-ng==11.2.0
    libgomp==11.2.0
    libopenblas==0.3.21
    libstdcxx-ng==11.2.0
    llvmlite==0.40.1
    marshmallow==3.10.0
    matplotlib==3.7.1
    more-itertools==8.12.0
    numpy==1.23.5
    openssl==1.1.1w
    pandas==1.4.4
    pip==23.0.1
    psycopg2==2.9.3
    pyjwt==2.4.0
    pyparsing==3.0.9
    python==3.10.13
    requests==2.31.0
    s3transfer==0.6.0
    scikit-image==0.19.3
    scikit-learn==1.1.1
    scipy==1.10.1
    setuptools==65.6.3
    statsmodels==0.13.2
    tabulate==0.9.0
    typing-extensions==4.4.0
    

    Then, run the conda install -y --file conda_requirements.txt command.

  5. Install some of the dependencies of the ibm-metrics-plugin library.

    git clone https://github.com/tommyod/KDEpy
    cd KDEpy
    git checkout ce23348
    pip install --no-build-isolation ./
    cd ..
    rm -rf KDEpy
    
    pip install --no-build-isolation cvxpy==1.3.2 osqp==0.6.2.post0
    
  6. To install the remaining dependencies with pip, create the pip_requirements.txt file by adding the following lines to the file:

    jenkspy==0.2.0
    marshmallow==3.10.0
    numba==0.57.1
    qdldl==0.1.7.post0
    shap==0.41.0
    tqdm==4.58.0
    ibm-db==3.1.4
    ibm-db-sa==0.3.9
    ibm-wos-utils==4.8.*
    ibm-metrics-plugin==4.8.*
    

    Then, run the pip install -r pip_requirements.txt command. If any dependencies fail to install, you can reinstall them to fix any issues.

  7. Verify the Conda virtual environment with the from ibm_wos_utils.drift.drift_trainer import DriftTrainer command. The following error might appear:

    ValueError: numpy.ndarray size changed, can indicate binary incompatibility.
    Expected 96 from C header, got 88 from PyObject
    

    If the error appears, reinstall the numpy library with the following commands:

    pip uninstall numpy
    pip install numpy==1.23.5
    
  8. Run the conda deactivate command to deactivate the virtual environment.

Step 2: Upload the archive

Generate an API authorization token and upload to a volume using the PUT /volumes API command:

curl -k -i -X PUT 'https://<cluster_url>/zen-volumes/<volume_name>/v1/volumes/files/py_packages%2Fwos_env?extract=true' -H "Authorization: ZenApiKey ${TOKEN}"  -H 'cache-control: no-cache' -H 'content-type: multipart/form-data' -F  'upFile=@/<path_to_parent_dir>/wos_env.zip'

Next steps

Configure batch processing in Watson OpenScale

Parent topic: Batch processing overview