Data Quality for AI API

My view is that if 80 percent of our work is data preparation, then ensuring data quality is the important work of a machine learning team. Andrew Ng,

Professor of AI at Standford University and founder of DeepLearning.AI

March 2021 https://www.deeplearning.ai/the-batch/issue-84/

Data Quality for AI from IBM Research

This Data Quality for AI (or DQAI, for short) framework of services provides all the tools to enable model developers and data scientists to implement a formalized and systematic program of data preparation, the preliminary and most time consuming step of the model development lifecycle. This framework is appropriate for data being readied for supervised classification or regression tasks. It includes the necessary software to:

— implement quality checks,
— execute remediation,
— generate audit reports,
— automate all the above.

While pipe-lining of tasks is essential for scalability and repeatability, the included capabilities can also be used for custom data exploration and human-guided improvement of models. Utilization of the included services can be productive at any stage in the model development lifecycle, the offering is designed to be especially valuable early in the data processing, in the data preparation stage.

In addition to all that can be accomplished on original data sources, there are methods that, starting from an input dataset, can help synthesize new data -- either for supplementation or for replacement -- by learning constraints in the original data or having them specified by a developer. This can be helpful when regulatory or contractual issues prohibit direct usage of data in a modeling effort, when it is desirable to explore datasets with different constraints, or when more data is needed for training.

This offering is appropriate for use on both tabular and time series data and new supported modalities being developed.

Capabilities

Data Validation

Quality scores and insights on those quality scores, even pointing to specific regions of data responsible for reducing the score and recommending how such data regions can be improved.

Data Remediation

Execute the recommendations provided by the quality analysis methods. The toolkit supports a variety of data types, including data tabular and time series data.

Data Constraints

The system can learn or the user can specify characteristics of the data (e.g., bounds, gaps, ...) .

Data Synthesis

Generate a new dataset having the characteristics and distributions of the first.

Pipelining

Combine validators and remediators together with constraints to address a use case or application workflow, outputs an overall data quality score

Reporting

Automated documentation of changes which records deltachanges in quality metrics and data transformations applied

What benefits can I realize in my modeling operations?

Comprehensive, compatible tooling

Data Quality for AI serves as a single, compatible source for many publicly available algorithms as well as novel methods developed exclusively by IBM Research.

Savings in Time and Cost

Reduce the time to value for a modeling effort by reducing the number of attmpted experiments and realized regressions in downstream tasks.

Formalized and simplified operations

Lower the barrier to adoption of AI across the enterprise by providing tooling to formalize and simolify the process of data preparation

Team standardization and coordination

Crossing cutting improvements on operational efficiency and productivity for the following defined roles: AI Steward, Data Scientist, Subject Matter Expert, AI Risk Officer, Business User.

A sample of the included utilities

Data Validation

— Label Purity Check — Data Homogeneity Check — Class Parity Check — Completeness Check — Outlier Detection Check — Feature Correlation Check — Data Bias Check — Feature Redundancy Check — and many more

Data Remediation

— Purity Remediation — Inhomogeneity Remediation — Class Disparity Remediation — Incomplete Remediation — Outlier Removal — Feature Correlation Removal — Data Bias Removal — Feature Redundancy Removal — and many more