Drift in data
In addition to checking for the drift in model accuracy, the drift monitor can detect the drift in data. This type of drift is defined as something that deviates from what is standard, normal, or expected. Watson OpenScale detects data drift so that you can make changes to the model.
Understanding drift detection
Drift is the degradation of predictive performance over time because of hidden context. As your data changes over time, the ability of your model to make accurate predictions may deteriorate. Watson OpenScale both detects and highlights drift so that you can take corrective action.
How it works
Watson OpenScale analyzes all transactions to find the ones that contribute to drift. It then groups the records based on the similarity of data inconsistency patterns that were significant in contributing to drift.
Data drift constraint specification
The constraints schema describes the statistics of training data as a set of single column and two column data boundaries. These statistics identify input data outliers to a machine learning model at runtime. Single-column constraints deal with each column individually while two-column constraints assume a relationship might exist between any two columns in the training data.
Constraints JSON Object
The constraints schema itself is specified as a JSON object with two array fields that describe all the columns and the constraints in the training data. The JSON object takes the following format:
{
columns: [],
constraints: []
}
Column statistics
Each element in the columns
describes the standard statistical properties of a column.
The data type of the column is indicated by the dtype
variable. Allowed values of dtype
variable are one of the following items:
categorical
numeric_discrete
numeric_continuous
A numeric column is described with its standard numerical bounds, such as minimum, maximum, mean, standard deviation, and its first, second and third quartile percentile values.
Common attributes of all constraints:
The name
field identifies the concrete type of a constraint. The value of the name
can be one of the following items:
categorical_distribution_constraint
numeric_range_constraint
numeric_distribution_constraint
catnum_range_constraint
catnum_distribution_constraint
catcat_distribution_constraint
The id
field is an internal field to identify each constraint uniquely. Its value is a UUID.
The kind
field identifies if the constraint is a single or two column constraint. Allowed values are one of single_column
, two_column
The columns
field is an array of column names. If the constraint deals with a single column, array contains a single element whose value is the name of the column. If the constraint deals with two columns, array contains the names of the two columns.
Single-feature constraints
Single-feature constraints, also know as single-column constraints, cannot be generated in the following instances:
- The contents of
categorical_distribution_constraint
has afrequency_distribution
attribute which has the frequency counts of each categorical value of the specified category. If a numeric column can be fitted in a distribution and it is not discrete or sparse, Watson OpenScale generates both range and distribution constraint. - The contents of
numeric_range_constraint
has aranges
attribute which has the high density regions of the numeric column. Any numeric ranges which rarely occur in training data are not included. If a numeric column is sparse or discrete Watson OpenScale generates the frequency distribution constraint , but not regular distribution and range. - The contents of
numeric_distribution_constraint
has adistribution
attribute which indicates if the specified numeric value follows a uniform, beta, exponential or normal distribution. The allowed values of thename
of the distribution are one ofbeta
,uniform
,expon
ornorm
.Thedistribution.parameters
has all the parameters of the corresponding distribution. Refer to the documentation of scipy.stats.beta, scipy.stats.uniform, scipy.stats.expon, scipy.stats.norm for the details ofeach parameter. If a numeric column data cannot be fitted into a distribution - depending on p-value computed , Watson OpenScale computes only the range constraint.
Double-feature constraints
Double-feature constraints, also known as two-column constraints, cannot be generated in the following instances:
- The contents of
catnum_range_constraint
has the source categorical column specified assource_column
and the target numeric column specified astarget_column
. Theranges
attribute contains the range of numeric values that can occur for a given value of categorical column. All such categories are dropped which very rarely occur. The numeric range only includes minimum, maximum values and the number of rows in training data with the corresponding categorical value. - The contents of
catnum_distribution_constraint
has the source categorical column specified assource_column
and the target numeric column specified astarget_column
. Thedistribution
attribute contains the distribution of numeric values for a given value of categorical column. Refer to thenumeric_distribution_constraint
above for more details on the distribution parameters. - The contents of
catcat_distribution_constraint
has the source categorical column specified assource_column
and the target categorical column specified astarget_column
. Therare_combinations
attribute contains all such pairs of source and target column values which rarely occur together in training data.
Working with large datasets
For data drift to be calculated successfully, very large datasets that consist of more than one-thousand columns (1,012) must be broken up. You must split the dataset into multiple datasets, each with a subset of columns, and the generate constraints.
For datasets, which have a large number of columns, that use one hot encoding, it is suggested that you write a wrapper on top of the model and provide Watson OpenScale a REST API of the scoring end point. In this way, Watson OpenScale can accept non one hot encoded data during training time and also while adding payload data.
Do the math
Watson OpenScale analyzes each transaction for data inconsistency, by comparing the transaction content with the training data patterns. If a transaction violates one or more of the training data patterns, the transaction is marked as drifted. Watson OpenScale then estimates the magnitude of data inconsistency as the fraction of drifted transactions to the total number of transactions analyzed. Further, Watson OpenScale analyzes all the drifted transactions; and then, groups transactions that violate similar training data patterns into different clusters. In each cluster, Watson OpenScale also estimates the important features that played a major role in the data inconsistency and classifies their feature impact as large, some, and small.
Next steps
- For information on how to set up drift detection, see Configuring the drift detection monitor.
- To mitigate drift, after it has been detected by Watson OpenScale, you must build a new version of the model that fixes the problem. A good place to start is with the data points that are highlighted as reasons for the drift. Introduce the new data to the predictive model after you have manually labeled the drifted transactions and use them to re-train the model.