Batch processing

Watson OpenScale can process millions of transactions without bringing the data into the Watson OpenScale datamart. To enable this feature, you must configure Watson OpenScale to work in batch mode by connecting a custom Watson OpenScale machine learning engine to an Apache Hive database and an Apache Spark analytics engine. Unlike online scoring, where the scoring is done in real-time and the payload data can be logged into the Watson OpenScale data mart, batch processing is done asynchronously. The batch processor reads the scored data and derives various model metrics.

Connecting your Spark instance

Because of the size of datasets to be processed, Watson OpenScale supports monitoring batch models by using Spark jobs. You can choose a Spark engine that is part of the Cloud Pak for Data environment or an external Hadoop Ecosystem.

Scenario

Watson OpenScale supports both batch and online modes. If you use predictive models principally in batch mode on a periodic basis there is support for the following cases:

The batch processor might process millions of lines of data and typically uses database tables as inputs and outputs. Data is available to the model (irrespective of the preceding cases) periodically as a set of records. Because of the volume of data, Watson OpenScale supports a seven-day monitoring schedule, which is unlike the online monitoring, which runs every hour for the fairness and quality monitors and every three hours for the drift monitor. Only after transactions are scored and logged into the payload database, are metrics produced.

You can subscribe to the model as a batch environment, even when there is no end point available. You can also use models that are embedded in an application or data pipeline code because these models score data in a batch mode.

Steps

  1. Configure the Apache Spark engine. Choose whether to set up an IBM Analytics Engine Powered by Apache Spark or a remote Spark engine:

  2. In Watson OpenScale, configure the batch processor.

Known limitations

The batch processing solution for Watson OpenScale has the following limitations:

General limitations

Limitations for Analytics Engine Powered by Apache Spark

Drift

Next steps

Parent topic: Configure Watson OpenScale