Batch processing
Watson OpenScale can process millions of transactions without bringing the data into the Watson OpenScale datamart. To enable this feature, you must configure Watson OpenScale to work in batch mode by connecting a custom Watson OpenScale machine learning engine to an Apache Hive database and an Apache Spark analytics engine. Unlike online scoring, where the scoring is done in real-time and the payload data can be logged into the Watson OpenScale data mart, batch processing is done asynchronously. The batch processor reads the scored data and derives various model metrics.
Connecting your Spark instance
Because of the size of datasets to be processed, Watson OpenScale supports monitoring batch models by using Spark jobs. You can choose a Spark engine that is part of the Cloud Pak for Data environment or an external Hadoop Ecosystem.
-
For Cloud Pak for Data, Watson OpenScale supports Analytics Engine Powered by Apache Spark.
- For more information, see Preparing the batch processing environment in IBM Analytics Engine Powered by Apache Spark.
- To use the Watson OpenScale Python SDK to set up your batch processor, see the sample notebook IBM Watson OpenScale and Batch Processing: Apache Spark on Cloud Pak for Data with IBM Analytics Engine
-
For an external Spark engine, you must meet certain requirements.
- When you choose a custom Spark engine that is external to Cloud Pak for Data, setting up the Spark Manager Application is one of the pre-requisites.
- This application must comply to a specification defined by Watson OpenScale and must expose the native APIs to Watson OpenScale so that the batch processor can interact with the external Hadoop Ecosystem.
- For more information, see Preparing the batch processing environment on the Hadoop Ecosystem.
- To use the Watson OpenScale Python SDK to set up your batch processor, see the sample notebook IBM Watson OpenScale and Batch Processing: Remote Spark
Scenario
Watson OpenScale supports both batch and online modes. If you use predictive models principally in batch mode on a periodic basis there is support for the following cases:
- The model scores transactions through an online end point that gets data periodically.
- Model code is embedded in a data pipeline and scoring transactions from data is available to it periodically.
The batch processor might process millions of lines of data and typically uses database tables as inputs and outputs. Data is available to the model (irrespective of the preceding cases) periodically as a set of records. Because of the volume of data, Watson OpenScale supports a seven-day monitoring schedule, which is unlike the online monitoring, which runs every hour for the fairness and quality monitors and every three hours for the drift monitor. Only after transactions are scored and logged into the payload database, are metrics produced.
You can subscribe to the model as a batch environment, even when there is no end point available. You can also use models that are embedded in an application or data pipeline code because these models score data in a batch mode.
Steps
-
Configure the Apache Spark engine. Choose whether to set up an IBM Analytics Engine Powered by Apache Spark or a remote Spark engine:
- To set up an IBM Analytics Engine Powered by Apache Spark, see Preparing the batch processing environment in IBM Analytics Engine Powered by Apache Spark.
- To set up a remote Spark engine, see Preparing the batch processing environment on the Hadoop Ecosystem.
-
In Watson OpenScale, configure the batch processor.
Known limitations
The batch processing solution for Watson OpenScale has the following limitations:
General limitations
- Only support for Structured data
- Only support for Production (Custom Machine Learning Provider) environment
- Combinations of environment supported
- Remote Spark Cluster + Non-kerberized Hive
- Remote Spark Cluster + Kerberized Hive
- Remote Spark Cluster + Db2
- IAE + Non-kerberized Hive
- IAE + Db2
- During an evaluation request, from the model summary screen, you might see an error on the Model Summary window that displays, “Evaluation for Quality/Drift monitor didn’t finish within 900 seconds.” Although you see the error, the actual monitor evaluation finishes to completion. If you encounter such an error, navigate back to the Insights dashboard, check if a quality or drift score is visible in the deployment tile and then come back to the Model Summary window.
Limitations for Analytics Engine Powered by Apache Spark
- You must create a new volume and not use the default volume. For more information, see Accessing data from storage.
- You must install the dependencies by using Python 3.7.x and upload them to the mount path.
Drift
- In your Hive table, if there is a column, whether feature or not, that is named
rawPredictionconfiguring and evaluating the drift monitor fails. - If there is a column named
probabilityin your Hive table, and it is not configured with modeling-role probability, configuring and evaluating the drift monitor fails. - The drifted transactions analysis notebook can be executed only from a local juypter lab and not having a hard dependency on Watson Studio.
- Pyspark ML, the framework that is used to build the drift detection model, does not entertain boolean fields when the drift model is trained. The training table must have any
booleancolumns represented asstring. -
If the drift detection model was generated by running the configuration notebook against a Hadoop Cluster (Cluster A) that is different from the Hadoop cluster (Cluster B) that is used for monitoring, evaluating the drift monitor fails. To correct this problem, you must perform the following steps:
- Download the drift archive by using the notebook.
- Extract the contents of the drift archive to a folder.
- In a text editor, open the
ddm_properties.jsonfile. - Look for the property
drift_model_path. This has the path where the drift model is stored in HDFS in this cluster. - Download the folder in the
drift_model_pathto your local workstation. - Copy this folder in an HDFS location
/new/pathin your production cluster. - Update the property
drift_model_pathin theddm_properties.json. The new property should look like the following sample:hdfs://production_cluster_host:port/new/path - Compress the contents of the drift archive folder as a tar.gz file. Do not compress the folder itself, only the contents. All the files must be present at the top location and not inside a folder in the archive.
Next steps
- Prepare the batch processing environment in IBM Analytics Engine Powered by Apache Spark.
- Prepare the batch processing environment on the Hadoop Ecosystem.
- Configure the batch processor in Watson OpenScale.
Parent topic: Configure Watson OpenScale