Configuring batch processing
After you prepare your deployment environment, you can configure batch processing in the Watson OpenScale service. You must configure a custom machine learning engine that connects to an Apache Spark analytics engine and connects to a database with Apache Hive or a Db2 database with JDBC.
Prerequisites
Ensure that the steps are completed for either Preparing the batch processing environment in IBM Analytics Engine powered by Apache Spark or Preparing the batch processing environment on the Hadoop Ecosystem. For the Hadoop Ecosystem implementation, ensure that the Watson OpenScale Spark Manager Application is running.
Step 1: Run a notebook to generate required artifacts for monitoring
You must run a notebook that generates a configuration package that you can use to configure evaluations in Watson OpenScale. You can run a advanced notebook that generates a configuration package with detailed input for Hive or JDBC. You can also run a standard notebook that generates a configuration package with less manual input for Hive or JDBC.
The configuration package contains the following artifacts:
- Common configuration JSON file
- Fairness statistics JSON file
- Drift archive
- Explain archive
Step 2: Create the machine learning provider for a custom environment
- Click the configure icon
to go to the system setup and create a new dedicated batch provider.
- Because batch processing supports only a custom environment, you must create a new machine learning provider and select the Custom Environment type. Provide credentials and for the environment type, select Production.
Step 3: Use the cluster details to create and save the database connection and Spark engine
Because Watson OpenScale batch requires both a database and a Spark engine, you must configure connections to these resources. After you create a new custom machine learning provider, you must set up batch support.
- From the Batch support tab, click Add batch support.
- Enter your Hive or JDBC connection information.
Watson OpenScale supports only JDBC connections to Db2 databases. - After you configure the connection, you must create a connection to a Spark engine. This step requires a Spark endpoint and connection credentials.
Step 4: Create the batch subscription
-
Return to the Watson OpenScale Insights Dashboard to add a new deployment. Click Add to dashboard.
-
Select the machine learning provider that you created.
-
Select the Self-managed deployment type.
-
Enter a deployment name and click Configure.
-
If you want to use a synchronous endpoint to score your deployment, specify a model endpoint.
The endpoint enables perturbation-based fairness evaluations, indirect bias evaluations, and perturbation-based explainability methods that are also available for online model deployments. -
Click Configure monitors.
-
Click the Edit icon on the Model input pane.
You must use the Numeric/categorical data type as the default option. -
Select an algorithm type and click Save and continue.
-
Click the Edit icon on the Configuration package tile to upload the
configuration_archive.tar.gz
file that you generated from your notebook.
If the configuration package contains evaluation artifacts, you don't need to manually upload the artifacts to enable the evaluations. For example, if your configuration package contains the drift archive, the drift archive is automatically uploaded to configure the drift evaluation. -
Click the Edit icon on the Analytics engine pane and select a Spark engine.
Specify your settings and click Save and continue. -
In the Payload data section, select Create new table, Use existing table, or Do not use.
If you select Do not use, click Next. You can't configure fairness, drift, or explainability evaluations if you don't use a payload table.
If you want to use a payload table, specify the payload data details and click Next. -
In the Feedback data section, select Create new table, Use existing table, or Do not use.
If you select Do not use, click Next. You can't configure quality evaluations if you don't use a feedback table.
If you want to use a feedback table, specify the feedback data details and click Next.
Step 5: Enable the fairness evaluation
- In the Evaluations section, click Fairness.
- If you did not upload a configuration package that contained the
fairness_statistics.json
file, run the common configuration notebook to generate the file and upload it.
The file is uploaded automatically if you provide a configuration package that contains thefairness_statistics.json
file when you specify your model details. - Specify the Favorable outcomes.
- Optional: Click the Edit icon on the Sample size pane.
If you specify a model endpoint when you create your batch subscription, you can select Evaluate using balanced set to specify a sample size percentage.
You can also specify a minimum sample size. - Specify the Features to evaluate.
Step 6: Enable the quality evaluation
- In the Evaluations section, click Quality.
- Click the Edit icon on the Quality thresholds pane and specify thresholds for your quality metrics.
- Optional: Click the Edit icon on the Sample size pane and specify the Minimum sample size.
If you don't specify a sample size, all of the model records are evaluated.
Step 7: Enable the drift evaluation
- In the Evaluations section, click Drift.
- In the Drift model section, click the Edit icon.
- The Training option defaults to Train in a data science notebook. Click Next.
- If you did not upload a configuration package that contained the drift archive, run the common configuration notebook to generate the
drift_archive.tar.gz
file and upload it.
The file is uploaded automatically if you provide a configuration package that contains the drift archive when you specify your model details. - In the Drifted data section, select Create new table or Use existing table.
Specify the Data warehouse connection information and click Next. - Enter the Drift threshold and click Next.
- Enter the Sample size and click Save.
- Click Go to model summary.
- From the Actions menu, click Evaluate now.
- From the Evaluate now panel, click Evaluate now.
Step 8: Enable explainability
- In the Explainability section, click General settings.
- In the Explanation data section, click the Edit icon.
- In the Explanation data section, select the Data warehouse connection name and specify details for the explanation results and explanation queue table.
- If you did not upload a configuration package that contained the explain archive, run the common configuration notebook to generate the
explainability.tar.gz
file and upload it.
The file is uploaded automatically if you provide a configuration package that contains the explain archive when you specify your model details. - Click the Edit icon on the Explanation method tile and specify the explanation methods that you want to use.
- If you enable SHAP global explanations or choose SHAP as the local explanation method, configure your SHAP settings on the SHAP tab.
Step 9: Run a notebook that provides drift analysis.
To view drift analysis when you enable batch processing, you must process transactions by using a custom notebook with Hive or JDBC for analyzing payload transactions that cause drift. You can download the notebook and the code snippet that you need to populate the notebook on the Drift monitor visualization window.
Limitations
When you enable batch processing, Watson OpenScale has the following limitations:
- If you create a self-managed deployment to configure batch processing on a s390x zLinux server instance, you must use
2
or more as the default Spark parameters for min/max executors and driver and executor cores/memory. - Only support for Structured data
- Only support for Production environments
- Combinations of environments supported
- Remote Spark Cluster + Non-kerberized Hive
- Remote Spark Cluster + Kerberized Hive
- Remote Spark Cluster + Db2
- IAE + Non-kerberized Hive
- IAE + Db2
- IAE + Kerberized Hive
- During an evaluation request, from the model summary screen, you might see an error on the Model Summary window that displays, “Evaluation for Quality/Drift monitor didn’t finish within 900 seconds.” Although you see the error, the actual monitor evaluation finishes to completion. If you encounter such an error, navigate back to the Insights dashboard, check if a quality or drift score is visible in the deployment tile, and then come back to the Model Summary window.
- You must create a new volume and not use the default volume when you use Analytics Engine powered by Apache Spark to prepare your deployment environment
- You must install dependencies by using Python 3.9.x or higher and upload them to the mount path when you use Analytics Engine powered by Apache Spark to prepare your deployment environment
- In your Hive table, if there is a column, whether feature or not, that is named
rawPrediction
, configuring and evaluating the drift monitor fails. - If a column named
probability
is in your Hive table, and it is not configured with modeling-role probability, configuring and evaluating the drift monitor fails. - Pyspark ML, the framework that is used to build the drift detection model, does not entertain Boolean fields when the drift model is trained. The training table must have any
boolean
columns represented asstring
. - When configuring batch subscriptions, if the partition column name is changed for an existing table, Watson OpenScale doesn't validate the column if the name is not specified in the table. You must verify that the partition column name that you specify is also in the table. If the partition column isn't in the table, your monitor evaluations might fail or run incorrectly.
- If the drift detection model was generated by running the configuration notebook against a Hadoop Cluster (Cluster A) that is different from the Hadoop cluster (Cluster B) that is used for monitoring, evaluating the drift monitor fails. To
correct this problem, you must perform the following steps:
- Download the drift archive by using the notebook.
- Extract the contents of the drift archive to a folder.
- In a text editor, open the
ddm_properties.json
file. - Look for the property
drift_model_path
. This property has the path where the drift model is stored in HDFS in this cluster. - Download the folder in the
drift_model_path
to your local workstation. - Copy this folder in an HDFS location
/new/path
in your production cluster. - Update the property
drift_model_path
in theddm_properties.json
. The new property looks like the following sample:hdfs://production_cluster_host:port/new/path
- Compress the contents of the drift archive folder as a tar.gz file. Do not compress the folder itself, only the contents. All the files must be present at the top location and not inside a folder in the archive.
When you want to specify a model endpoint to configure a Watson Machine Learning batch deployment, Watson OpenScale has the following limitations:
- You must create a remote Watson Machine Learning provider.
- You can't use the Watson Machine Learning deployment space that contains your online deployment to add a batch deployment.
Parent topic: Batch processing in Watson OpenScale