Available types of data quality checks

To identify quality problems with your data, you can automatically identify and run applicable data quality checks as part of metadata enrichment.

Service One of the IBM Knowledge Catalog editions must be installed. The data quality feature must be enabled.

The following types of data quality checks can be automatically created in metadata enrichment based on profiling results, assigned business terms, or existing key relationships. You can also create any of those checks manually for your data. The checks can then be run as a separate metadata enrichment step.

When these checks are run, they validate individual values (value-based checks) within a column, then generate individual quality scores and contribute to the overall quality score of a data asset or a column. Also, each data quality check is associated with a data quality dimension.

The results of these checks are shown as part of the data quality information for a data asset or a column. For each column with findings, more details about the check are provided. If data quality issues are recorded in an output table, users with the Drill down into quality issues permission have access to the records with findings.

A project administrator can restrict the origin and the types of data quality checks that can be automatically created in the project settings for metadata enrichment.

Only one check per type can be active. The priority order is as follows:

  • User-defined check
  • Check generated from a business term constraint
  • Check generated from profiling results

Capitalization style check

This check examines whether the capitalization of values in a column is consistent. In columns of data type String, values can have any case, title case, sentence case, or can be uppercase or lowercase.

For checks that are generated based on profiling results, values are checked against the inferred capitalization style. For checks that are generated based on business term constraints or that are manually added, values are checked against the prescribed capitalization style.

Origins

  • Automatically generated based on profiling results

    More than 90% of the values must have the same specific capitalization style. If no predominant capitalization style is detected, the check is not generated.

  • Automatically generated based on a business term constraint

  • Manually added on a column

Options

You can set a specific capitalization style when you define a data quality check on a business term, add a check manually on a column, or edit an existing check. Choose from these options:

  • Lowercase
  • Sentence case
  • Title case
  • Uppercase

Dimension

Consistency

Completeness check

This check looks for missing values in columns. The null threshold determines the allowed percentage of null or empty values. This threshold is set in the metadata enrichment settings. The default setting is 5%, which means that missing values in a column are allowed in 5% or less of the rows.

Origins

  • Automatically generated based on profiling results

    A column must contain null and empty values. The percentage of such values must exceed the threshold.

  • Automatically generated based on a business term constraint

  • Automatically generated based on existing key relationships

    In this case, the check is generated for the primary-key column of the relationship.

  • Manually added on a column

Options

None.

Dimension

Completeness

Data class check

A data class is the kind of data that is detected for a particular column. Examples of data class might include postal code, country, or credit card number. This check counts the number of values in a column that do not match the inferred or selected data class of that column. Each value that violates the class is identified. The quality score is based on the percentage of values identified subtracted from a percentage of 100.

Only data classes that work at value level can be used to identify violations. Such data classes work with the matching methods Match to a list of valid values, Match to reference data, Match to criteria in a regular expression, and, depending on the logic specified in the Java class, Match to criteria in deployed Java class. Data classes that work at column level only compute a confidence that the column as a whole is of a given class, but cannot test any individual value. Data classes with the scope column are ignored by this check.

For example, a column has a data class 'credit card number' assigned. The expected value for that data class is a numeric string of 16 characters. If that column contains a value of 'MA,' then that value is identified as a violation of the data class.

Origins

  • Automatically generated based on profiling results
  • Automatically generated based on a business term constraint
  • Manually added on a column

If multiple data class checks are found for a column, the priority order is as follows:

  1. Manually added check
  2. Check generated from profiling results for a user-assigned data class
  3. Check generated from a business term constraint
  4. Check generated for a data class that is related to an assigned business term
  5. Check generated from profiling results for an inferred data class

Options

You can select the data class for which you want to check when you define a data quality check on a business term, add a check manually on a column, or edit an existing check.

Dimension

Validity

Data type check

A data type defines the valid format for data in a particular column. Examples of data type might include text, numeric, or date. This check counts the number of values in a column that do not match the detected or assigned data type of a column. Each value that does not match the inferred or selected data type in length, precision, or scale, or violates the specified data type is identified.

Origins

  • Automatically generated based on profiling results

    More than 50% of the values must have the same data type. If no predominant data type is detected, the check is not generated. For data from relational databases, this check is created only if the data type that is defined in the database is String but a different type is found.

  • Automatically generated based on a business term constraint

  • Manually added on a column

Options

You can select the data type for which you want to check when you define a data quality check on a business term, add a check manually on a column, or edit an existing check. Choose from these options:

  • TinyInt
  • SmallInt
  • Integer
  • BigInt
  • Date
  • String
  • Time
  • Timestamp
  • Decimal

Dimension

Validity

Format check

This check validates the pattern for data in columns that have a standard general format, for example, telephone numbers or social security numbers. This type of check is useful when the format of the data is critical to automated processing or to the display of the data. A general format is inferred based on the profiling results, defined in a business term constraint, or defined in a manually created check.

The quality score is based on the percentage of values identified as violating the format requirements subtracted from a percentage of 100.

Origins

  • Automatically generated based on profiling results

    More than 50% of the values must have the same general format. If no predominant format is detected, the check is not generated.

  • Automatically generated based on a business term constraint

  • Manually added on a column

Options

You can build the pattern that defines what is acceptable in each specific character position when you define a data quality check on a business term, add a check manually on a column, or edit an existing check. Use 'A' for any uppercase letter, 'a' for any lowercase letter, and '9' for any 0-9 digit. Any other character in the pattern string indicates that you are explicitly looking for the exact character you entered.

Dimension

Validity

Historical stability check

This check examines the stability and consistency of your data over time. The result of the latest analysis is compared to the collected data from previous analyses. When the check is applied at asset level, it evaluates the row count against previous results. When the check is applied to a column, various value counts and the average value can be evaluated.

For this check to return useful results, collection of historical data must be enabled in the default enrichment settings. For checks at asset level, the connector to the data source must allow for retrieval of the total record count, whether approximated or accurate. Retrieval of accurate counts is disabled by default because it's a costly operation. A cluster or instance administrator must be explicitly enable it at the system level.

If historical stability checks are run while not enough historical data is available for comparison, the checks are listed on the Data quality page for an asset or a column and show a quality score of 100%. In that case, you have these options:

  • Ensure that enough data is collected by running metadata enrichment at least twice before you enable the check. The Collect historical data option must be turned on. Keep in mind that some data sources do not provide the corresponding information. For example, if the data source is file-based, collecting the record count might not be allowed.
  • Exclude the check result from the overall data quality score by setting Contributes to overall score to Off on the Data quality page until the check can provide useful results.

The check results include the following information:

  • The range in which the row count or the values are expected. The range is usually the value of the standard deviation below the mean and the value of the standard deviation above the mean.
  • The actual result from the last analysis.
  • The data quality score for the checked asset or column. The score is calculated based on the deviation from the threshold.
  • Statistics from past analyses: the mean, which equals the average of the gathered data points, and the standard deviation, which measures how data points are spread around the mean.
  • Statistics from the most recent analysis: the absolute deviation and the z-score, which indicates how far away the value is from the mean. The z-score is measured based on the standard deviation.

Origins

  • Automatically generated based on profiling results (asset-level checks only)

    Data from at least 10 earlier analyses must be available for the check to be automatically generated.

  • Manually added on a column or an asset

Options

When you add this check on an asset, only the row count is considered.

When you add this check on a column, you can select what you want to be evaluated:

  • Average value
  • Distinct values count
  • Missing values count
  • Unique values count

For each selection, a separate entry is created on the Data quality page.

Dimension

Homogeneity

Length check

The check evaluates whether the value length is within the allowed range.

Origins

  • Automatically generated based on a business term constraint
  • Manually added on a column

Options

When you define a data quality check on a business term, add a check manually on a column, or edit an existing check, you can set the minimum and the maximum length.

Dimension

Validity

Missing values representation check

It is common for data assets to contain varying representations of missing data. One column in a data asset might contain several values of NULL, several others that read NA, and still others where the field is empty. All of these values might suggest missing information, but they are interpreted differently and can lead to inaccurate analysis. This check evaluates whether missing values are represented in a consistent way. When you adjust a generated check, add a check manually, or create a business term constraint, you .

Origins

  • Automatically generated based on profiling results

    The column must be nullable and must have both null values and empty values.

  • Automatically generated based on a business term constraint

  • Manually added on a column

Options

When you define a data quality check on a business term, add a check manually on a column, or edit an existing check, you can can select whether missing values should be represented in a column as null or as empty values.

Dimension

Consistency

Possible values check

This check evaluates column values against values on a reference list of allowed values.

Origins

  • Automatically generated based on a business term constraint
  • Manually added on a column

Options

When you define a data quality check on a business term, add a check manually on a column, or edit an existing check, you can provide a list of possible values. Enter each value on a separate line.

Dimension

Validity

Range check

This check examines whether values are within the allowed range.

Origins

  • Automatically generated based on profiling results

    The check is generated only if the data type of the column is numeric and the minimum value is greater than 0. For generated checks, the minimum length is set to 0.

  • Automatically generated based on a business term constraint

  • Manually added on a column

Options

When you define a data quality check on a business term, add a check manually on a column, or edit an existing check, you can specify a minimum value, a maximum value, or both.

Dimension

Validity

Referential integrity check

The check can be applied to columns that have been identified as a foreign key in a key relationship and evaluates whether the relationship between the foreign key and the primary key is valid. Only assigned relationships are checked. Suggested relationships are not checked. Foreign key values that do not match the primary key value in the parent table are identified as violations.

Currently, the check is available for single-column keys only.

Origins

  • Automatically generated based on existing key relationships

    When a referential integrity check is automatically generated for a foreign-key column, also a completeness check and a uniqueness check are created for the corresponding primary-key column.

  • Manually added on a foreign-key column in an assigned relationship

    Also add completeness and uniqueness checks for the primary-key column.

Options

None.

Dimension

Consistency

Regex check

This check evaluates whether values match the pattern specified by the regular expression. The regular expression must use JavaScript format.

Origins

  • Automatically generated based on a business term constraint
  • Manually added on a column

Options

When you define a data quality check on a business term, add a check manually on a column, or edit an existing check, you can specify a regular expression in JavaScript format.

Dimension

Validity

Suspect values

This check looks for suspect values that do not seem to match the majority of the other values in the column because their characteristics are different. It identifies outliers in numeric columns or string columns with numeric data. For non-numeric data, a domain similarity threshold is calculated for the column. If the similarity score of a non-numeric value is below that threshold, the value is considered an outlier.

Origins

  • Automatically generated based on a business term constraint
  • Manually added on a column

Options

None.

Dimension

Consistency

Uniqueness check

This check identifies duplicated values in columns where most of the values are unique. All non-unique values are flagged as quality issues. The uniqueness threshold in the metadata enrichment settings determines the percentage of distinct values that must be unique. The default setting is 95%.

Origins

  • Automatically generated based on profiling results

  • Automatically generated based on a business term constraint

  • Automatically generated based on existing key relationships

    In this case, the check is generated for the primary-key column of the relationship.

  • Manually added on a column

Options

None.

Dimension

Uniqueness

Learn more

Parent topic: Managing data quality