Managing data quality

Measure, monitor, and maintain the quality of your data to ensure the data meets your expectations and standards for specific use cases.

Data of good quality is in a state that usually can be defined as fit for use, defect free, or meeting expectations and requirements. Data quality is measured against the default quality dimensions Accuracy, Completeness, Consistency, Timeliness, Uniqueness, and Validity, and any custom quality dimension.

Data quality analysis provides answers to these questions:

  • How good is the overall quality of a data asset?
  • Which of the data assets has the better quality?
  • How did the quality of a data asset change over time?
  • Does the data asset meet my quality expectations?

Requirements and restrictions

For data quality management, the following requirements and restrictions exist.

Service The IBM Knowledge Catalog, IBM Knowledge Catalog Standard, and IBM Knowledge Catalog Premium services are not available by default. An administrator must install one of the services. To determine whether a service is installed, open the Services catalog. If the service is installed and ready to use, the tile in the catalog shows Ready to use.

This feature is not available by default. For creating and running data quality rules, the data quality feature must be installed with IBM Knowledge Catalog or IBM Knowledge Catalog Premium. To see whether the feature is enabled, check whether the asset types Data quality definition and Data quality rule are available when you select to add a new asset to a project.

Service The DataStage Enterprise service is automatically installed when the data quality feature is enabled in IBM Knowledge Catalog or IBM Knowledge Catalog Premium. If you did not purchase a DataStage license, use of DataStage Enterprise is limited to creating, managing, and running data quality rules. For examples of accepted use, see Enabling optional features after installation or upgrade for IBM Knowledge Catalog.

Data quality tools

You work with these tools:

Data formats

The following data formats are supported:

  • Tables from relational and nonrelational data sources
  • Delta Lake and Iceberg tables from certain file-storage connectors
  • Tabular: Avro, CSV, Parquet, ORC; for data assets uploaded from the local file system, CSV only

For information about supported connectors, see Supported data sources for curation and data quality.

Data size

Data quality management tasks can be performed on data of any size.

Required permissions

Your roles determine which data quality management tasks you can perform:

  • To view data quality definitions and rules, you must have at least the Viewer role in the project.
  • To create, edit, or delete data quality definitions and rules, you must have the Manage data quality assets user permission and the Admin or the Editor role in the project.
  • To run data quality rules, you must have the Admin or the Editor role in the project and, for running rules from the Assets page, from within a data quality rule asset, or by using the API, you must also have the Measure data quality user permission. The user permission is not required to run a data quality rule job from the Jobs page or the associated DataStage flow.
  • To view the data that caused data quality issues (the output table) from the rule run history or the Data quality page, you must have the Drill down to issue details user permission. However, the data asset in the project that is created for the output table is accessible by anyone who can access the connection. To limit access to this data asset, the connection to the data source where the output table is stored should be set up with personal credentials.
  • To create, edit, or delete data quality SLA rules, you must have these user permissions:
    • Access governance artifacts
    • Manage data quality SLA rules

Workspaces

You can perform data quality management tasks in projects. Read-only data quality information is available in catalogs.

Data quality analysis and monitoring

Use data quality analysis and monitoring to evaluate data against specific criteria. Use these evaluation criteria repeatedly over time to see important changes in the quality of the data being validated.

After a data quality check is designed, you have these options:

  • Create a data quality definition that defines the logic of the data check irrespective of the data source. The definition contains logical variables or references that you link or bind to actual data (for example, data source, table and column or joined tables) when you create a data quality rule that can be executed.

    After you create a data quality rule with the required bindings based on a select data quality definition, that rule can be executed. The rule produces relevant statistics and can generate an output table, depending on the rule configuration.

  • Create an SQL-based data quality rule.

The functionality of a data quality rule can range from a simple single column test to evaluating multiple columns within and across data sources.

Assessing data quality

To determine whether your data is of good quality, check in how far the data meets your expectations and identify anomalies in the data. Evaluating your data for quality also helps you to understand the structure and content of your data.

Monitoring data quality

To ensure that important data meets your organization's quality expectations, implement data quality SLA rules that monitor your data for compliance with the standards and can provision for remediation of detected data quality issues.

Learn more

Parent topic: Preparing data