On the importance of data quality for machine learning

Data Quality for AI from IBM Research

This Data Quality for AI (or DQAI, for short) framework of services provides all the tools to enable model developers and data scientists to implement a formalized and systematic program of data preparation, the preliminary and most time consuming step of the model development lifecycle. This framework is appropriate for data being readied for supervised classification or regression tasks. It includes the necessary software to:

— implement quality checks,
— execute remediation,
— generate audit reports,
— automate all the above.

While pipe-lining of tasks is essential for scalability and repeatability, the included capabilities can also be used for custom data exploration and human-guided improvement of models. Utilization of the included services can be productive at any stage in the model development lifecycle, the offering is designed to be especially valuable early in the data processing, in the data preparation stage.

In addition to all that can be accomplished on original data sources, there are methods that, starting from an input dataset, can help synthesize new data -- either for supplementation or for replacement -- by learning constraints in the original data or having them specified by a developer. This can be helpful when regulatory or contractual issues prohibit direct usage of data in a modeling effort, when it is desirable to explore datasets with different constraints, or when more data is needed for training.

This offering is appropriate for use on both tabular and time series data and new supported modalities being developed.


Data Quality for AI

What can DQAI do for my machine learning workflows?

Capabilities

What benefits can I realize in my modeling operations?

Benefits

What utilities are included?

A sample of the included utilities

Many of the quality checks have corresponding remediation procedures

Data Validation

— Label Purity Check

— Data Homogeneity Check

— Class Parity Check

— Completeness Check

— Outlier Detection Check

— Feature Correlation Check

— Data Bias Check

— Feature Redundancy Check

— and many more

Data Remediation

— Purity Remediation

— Inhomogeneity Remediation

— Class Disparity Remediation

— Incomplete Remediation

— Outlier Removal

— Feature Correlation Removal

— Data Bias Removal

— Feature Redundancy Removal

— and many more

After trying the free trial API version

Contact us about a license to the complete library by contacting
IBM Research Business Development
at resai@us.ibm.com