Table of contents

Batch processing

Watson OpenScale can process millions of transactions without bringing the data into the Watson OpenScale datamart. To enable this feature, you must configure Watson OpenScale to work in batch mode by connecting a custom Watson OpenScale machine learning engine to an Apache Hive database and an Apache Spark analytics engine. Unlike online scoring, where the scoring is done in real-time and the payload data can be logged into the Watson OpenScale data mart, batch processing is done asynchronously. The batch processor reads the scored data and derives various model metrics.

Connecting your Spark instance

Because of the size of datasets to be processed, Watson OpenScale supports monitoring batch models by using Spark jobs. You can choose a Spark engine that is part of the Cloud Pak for Data environment or an external Hadoop Ecosystem.

Scenario

Watson OpenScale supports both batch and online modes. If you use predictive models principally in batch mode on a periodic basis there is support for the following cases:

  • The model scores transactions through an online end point that gets data periodically.
  • Model code is embedded in a data pipeline and scoring transactions from data is available to it periodically.

The batch processor might process millions of lines of data and typically uses database tables as inputs and outputs. Data is available to the model (irrespective of the preceding cases) periodically as a set of records. Because of the volume of data, Watson OpenScale supports a seven-day monitoring schedule, which is unlike the online monitoring, which runs every hour for the fairness and quality monitors and every three hours for the drift monitor. Only after transactions are scored and logged into the payload database, are metrics produced.

You can subscribe to the model as a batch environment, even when there is no end point available. You can also use models that are embedded in an application or data pipeline code because these models score data in a batch mode.

Steps

  1. Ensure that you have applied any software patches for your Watson OpenScale service instance. The cpd-aiopenscale-3.5.0-patch-1 patch adds the ability to do batch processing. For details, see Available patches for Watson OpenScale for IBM Cloud Pak for Data.
  2. Configure the Apache Spark engine. Choose whether to set up an IBM Analytics Engine or a remote Spark engine:

  3. In Watson OpenScale, configure the batch processor

Known limitations

The batch processing solution for Watson OpenScale has the following limitations:

General limitations

  • Only support for Quality and Drift and Explainability monitoring
  • Only support for Structured data
  • Only support for Production (Custom Machine Learning Provider) environment
  • Combinations of environment supported
    • Remote Spark Cluster + Non-kerberized Hive
    • Remote Spark Cluster + Kerberized Hive
    • IAE + Non-kerberized Hive
  • During an evaluation request, from the model summary screen, you might see an error on the Model Summary window that displays, “Evaluation for Quality/Drift monitor didn’t finish within 900 seconds.” Although you see the error, the actual monitor evaluation finishes to completion. If you encounter such an error, navigate back to the Insights dashboard, check if a quality or drift score is visible in the deployment tile and then come back to the Model Summary window.

Limitations for IBM Analytics Engine

  • You must create a new volume and not use the default volume. For more information, see Accessing data from storage.
  • You must install the dependencies by using Python 3.7.x and upload them to the mount path.

Drift

  • In your Hive table, if there is a column, whether feature or not, that is named rawPrediction configuring and evaluating the drift monitor fails.
  • If there is a column named probability in your Hive table, and it is not configured with modeling-role probability, configuring and evaluating the drift monitor fails.
  • The drifted transactions analysis notebook can be executed only from a local juypter lab and not having a hard dependency on Watson Studio.
  • Pyspark ML, the framework that is used to build the drift detection model, does not entertain boolean fields when the drift model is trained. The training table must have any boolean columns represented as string.
  • If the drift detection model was generated by running the configuration notebook against a Hadoop Cluster (Cluster A) that is different from the Hadoop cluster (Cluster B) that is used for monitoring, evaluating the drift monitor fails. To correct this problem, you must perform the following steps:

    1. Download the drift archive by using the notebook.
    2. Extract the contents of the drift archive to a folder.
    3. In a text editor, open the ddm_properties.json file.
    4. Look for the property drift_model_path. This has the path where the drift model is stored in HDFS in this cluster.
    5. Download the folder in the drift_model_path to your local workstation.
    6. Copy this folder in an HDFS location /new/path in your production cluster.
    7. Update the property drift_model_path in the ddm_properties.json. The new property should look like the following sample: hdfs://production_cluster_host:port/new/path
    8. Compress the contents of the drift archive folder as a tar.gz file. Do not compress the folder itself, only the contents. All the files must be present at the top location and not inside a folder in the archive.

Next steps