Preparing the batch processing environment in IBM Analytics Engine Powered by Apache Spark on zLinux
To configure batch processing in the Watson OpenScale service on zLinux, you must prepare the deployment environment.
Requirements for Analytics Engine powered by Apache Spark on zLinux
You must install Conda version 4.12.0 with Python version 3.10. The tool can be installed with Miniconda.
Limitations
Support for the zLinux platform has the following limitations:
- Only scikit-learn, XGBoost frameworks, and Python functions are supported for IBM Watson Machine Learning.
- Datamart databases must be created with a larger page size than the default value to work well with wide datasets, as shown in the following example:
CREATE DB {db name} PAGESIZE {PAGESIZE integer} (8192 for 8K or more)
- Hive tables that are created with the ORC format are not supported while monitoring batch subscriptions with IBM Analytics Engine and Hive.
Step 1: Create a Conda archive of modules that are required for model evaluation
Model evaluations that run on the Hadoop Ecosystem require some dependent Python packages. Without these packages, the evaluations fail. You can use the following steps to install these dependencies and upload them to a location in the Hadoop file system (HDFS):
-
Install Conda on the zLinux operating system. Red Hat Enterprise Linux versions 8.x are recommended for building conda packages with
open-ce
. -
Create a conda environment to build the
llvmlite
virtual environment as a conda package for the s390x zLinux server instance.cd /opt conda create -y -n llvmlite-env python=3.10 conda activate llvmlite-env conda install -y conda-build conda install -y -c open-ce open-ce-builder
a. Run the following commands to enable the
llvmlite
conda package build with anopen-ce
environment YAML file:git clone -b open-ce-r1.9 https://github.com/open-ce/open-ce.git cd open-ce yum install patch echo "packages:" > envs/llvmlite-env.yaml grep -A1 llvmlite envs/opence-env.yaml >> envs/llvmlite-env.yaml
b. Build the
llvmlite
conda package:open-ce build env --python_versions 3.10 --build_types cpu envs/llvmlite-env.yaml
c. Check the condabuild folder to find the built conda package:
# ls -1 condabuild/ channeldata.json index.html linux-s390x noarch opence-conda-env-py3.10-cpu-openmpi.yaml # ls -1 condabuild/linux-s390x/ current_repodata.json current_repodata.json.bz2 index.html llvmlite-0.40.1-py310h011d4d7_0.conda repodata_from_packages.json repodata_from_packages.json.bz2 repodata.json repodata.json.bz2
d. Run the conda index command on the condabuild directory:
conda index condabuild
e. Add the following parameters in the ~/.condarc file to use the condabuild folder as a local conda channel:
channels: - /opt/open-ce/condabuild - defaults
f. Check the Conda channel to find the llvmlite conda package:
conda search llvmlite
g. Deactivate the
llvmlite-env
conda virtual environment:conda deactivate
-
Create a Conda virtual environment.
conda create -n wos_env python=3.10.13 conda activate wos_env
Then, activate the Conda virtual environment.
-
To install all of the dependencies that are available on the conda channels, create the
conda_requirements.txt
file by adding the following lines to the file:boto3==1.24.28 botocore==1.27.59 cython==0.29.36 h5py==3.7.0 hdf5==1.10.6 jmespath==0.10.0 joblib==1.1.1 ld_impl_linux-s390x==2.38 libgcc-ng==11.2.0 libgomp==11.2.0 libopenblas==0.3.21 libstdcxx-ng==11.2.0 llvmlite==0.40.1 marshmallow==3.10.0 matplotlib==3.7.1 more-itertools==8.12.0 numpy==1.23.5 openssl==1.1.1w pandas==1.4.4 pip==23.0.1 psycopg2==2.9.3 pyjwt==2.4.0 pyparsing==3.0.9 python==3.10.13 requests==2.31.0 s3transfer==0.6.0 scikit-image==0.19.3 scikit-learn==1.1.1 scipy==1.10.1 setuptools==65.6.3 statsmodels==0.13.2 tabulate==0.9.0 typing-extensions==4.4.0
Then, run the
conda install -y --file conda_requirements.txt
command. -
Install some of the dependencies of the ibm-metrics-plugin library.
git clone https://github.com/tommyod/KDEpy cd KDEpy git checkout ce23348 pip install --no-build-isolation ./ cd .. rm -rf KDEpy pip install --no-build-isolation cvxpy==1.3.2 osqp==0.6.2.post0
-
To install the remaining dependencies with
pip
, create thepip_requirements.txt
file by adding the following lines to the file:jenkspy==0.2.0 marshmallow==3.10.0 numba==0.57.1 qdldl==0.1.7.post0 shap==0.41.0 tqdm==4.58.0 ibm-db==3.1.4 ibm-db-sa==0.3.9 ibm-wos-utils==4.8.* ibm-metrics-plugin==4.8.*
Then, run the
pip install -r pip_requirements.txt
command. If any dependencies fail to install, you can reinstall them to fix any issues. -
Verify the Conda virtual environment with the
from ibm_wos_utils.drift.drift_trainer import DriftTrainer
command. The following error might appear:ValueError: numpy.ndarray size changed, can indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject
If the error appears, reinstall the
numpy
library with the following commands:pip uninstall numpy pip install numpy==1.23.5
-
Run the
conda deactivate
command to deactivate the virtual environment.
Step 2: Upload the archive
Generate an API authorization token and upload to a volume using the PUT /volumes API
command:
curl -k -i -X PUT 'https://<cluster_url>/zen-volumes/<volume_name>/v1/volumes/files/py_packages%2Fwos_env?extract=true' -H "Authorization: ZenApiKey ${TOKEN}" -H 'cache-control: no-cache' -H 'content-type: multipart/form-data' -F 'upFile=@/<path_to_parent_dir>/wos_env.zip'
Next steps
Configure batch processing in Watson OpenScale
Parent topic: Batch processing overview