Data quality dimensions violations (Watson Knowledge Catalog)

Data quality analysis identifies quality problems with your data by analyzing quality dimensions, both on the data asset and the column level.

Data quality dimensions evaluated

You find this information on the Data quality tab when you open a data asset in your data quality project, either for the entire data asset or at column level.

For any violation that is found, further information is provided:

At the data asset level, expand the dimension to see which columns are affected. Click a column name to see the dimension details for that columns.
At the column level, click the dimension name for further details.

You can then decide for each data quality violation to be either ignored or evaluated during the analysis. You can make this decision for the entire data asset or for individual columns. Go to the respective Data quality tab, click Edit, and modify the value in the Ignore column.

The Delta column in the results table shows how the number of violations changed between the last two analyses.

Results are provided for the following data quality violations:

Data class violations
Data type violations
Duplicated values
Format violations
Inconsistent capitalization
Inconsistent representation of missing values
Missing values
Suspect values
Suspect values in correlated columns
Values out of range
Rule violations

Data class violations

A data class is the kind of data detected for a particular column. Examples of data class might include postal code, country, or credit card number. This metric counts the number of values in a column that do not match the detected data class of that column. Each value that violates the class is identified. The quality score is based on the percentage of values identified subtracted from a percentage of 100.

For example, a column has a data class 'credit card number' assigned. The expected value for that data class is a numeric string of 16 characters. If that column contains a value of 'MA,' then that value is identified as a violation of the data class. If that column has 100 values, 40 values do not match the class, and no other quality dimensions are identified, the column has a quality score of 60% because 40% of the values violate the column's data class.

Addressing data class violation: You can manually override the data class selected by the analysis by editing the analysis results.

Data type violations

A data type defines the valid format for data in a particular column. Examples of data type might include text, numeric, or date. This metric counts the number of values in a column that do not match the detected or assigned data type of a column. Each value that does not match the inferred data type in length, precision, or scale, or violates the specified data type is identified. The quality score is based on the percentage of values identified subtracted from a percentage of 100.

For example, a column has a data type DECIMAL (4,2) specified. That data type defines the format of the column as a numeric value with a total length of 4 digits with 2 of those digits following the decimal point. If that column contains a numeric value with too many digits, then that value is identified as a violation of the data type. If that column has 100 values, 40 values do not match the type, and no other quality dimensions are identified, the column has a quality score of 60% because 40% of the values violate the column's data type.

Addressing data type violation: You can manually override the data type detected by the analysis by editing the analysis results.

Duplicated values

This dimension identifies duplicated values in columns where most of the values are unique. In a column where at least 95% of the values are identified as unique, each duplicate value is identified. The quality score is based on the percentage of values identified subtracted from a percentage of 100.

For example, a set of patient data contains a column with social security numbers. The majority of the values in the column appear only once because each patient is only associated with one SSN. Each duplicate value in this column is identified. If the column has 100 values, 3 values are duplicates, and no other quality dimensions are identified, the column has a quality score of 97% because 3% of the values are duplicates.

Addressing duplicated values violation: You can change the uniqueness setting for a column to allow duplicate values by editing the analysis results.

Format violations

A format is a pattern for data in a particular column that correlates to the data class of that column. In some cases, certain formats might be specified as invalid for a particular column in an analysis. In those cases, this metric counts the values in the column that match the invalid format. You can set invalid formats in the column view, in the Formats tab. Each value that matches a format marked as invalid is identified. The quality score is based on the percentage of values identified subtracted from a percentage of 100.

Addressing format violations: You can override the format detected by the analysis by editing the analysis results.

Inconsistent capitalization

This dimension checks whether the usage of uppercases and lowercases in the analyzed data asset is consistent.

For example, a column has values that are written in both lowercase and uppercase. If the column has 100 values, 90 of them are in lowercase, and 10 of them are in uppercase, and no other quality dimensions are identified, the column has a quality score of 90% because 10% of the values are in a different case than the majority.

Addressing inconsistent capitalization violation: You can investigate the identified column or columns to get more information and determine the best response. For example, in some cases, you might need to create a note to suggest standardization for a column. If the values are correct, edit the data asset and change the setting for this violation to be ignored.

Inconsistent representation of missing values

It is common for data assets to contain varying representations of missing data. One column in a data asset might contain several values of NULL, several others that read NA, and still others where the field is blank. All of these values might suggest missing information, but they are interpreted differently and can lead to inaccurate analysis. The inconsistent representation of missing values is detected by identifying columns with both null values and empty values. A column that contains both null values and empty values suggests that there is no standardized way to represent missing values. Often when a column contains null values, any empty values should also be represented as null.

Each value that matches this criteria in a column is identified. The quality score is based on the percentage of values identified subtracted from a percentage of 100.

Addressing representation of missing values violations: You can investigate the identified column or columns to get more information and determine the best response. For example, in some cases, you might need to create a note to suggest standardization for a column. If the values are correct, edit the data asset and change the setting for this violation to be ignored.

Missing values

This dimension looks for missing values in a column. Rows with missing values are deemed incomplete. The quality score is based on the percentage of rows in that column that are complete.

For example, if a column has 100 values, 40 of those are missing values, and no other quality dimensions are identified, the quality score is 60% because 60 out of 100 values are identified as complete.

Addressing missing values violations: You have the option to allow null values for a column.

You can change the missing values setting for a column to allow null values by editing the analysis results.

Suspect values

When the data class of a column cannot be determined, this metric looks for suspect values that do not seem to match the majority of the other values in the column because their characteristics are different. Each suspect value that violates the domain is identified. The quality score is based on the percentage of values identified subtracted from a percentage of 100.

For example, if a column contains 100 values, and 98 of those values are numeric strings in the range 5 - 9 characters in length, but two are 30-45 character text strings, those two values are identified as suspect because they do not match the characteristics of the other values. If no other quality dimensions are identified, the column has a quality score of 98% because 2% of the values are suspect.

Addressing suspect values violations: You can investigate the identified column or columns to get more information and determine the best response. For example, in some cases, you might need to create a note to suggest standardization for a column. If the value is correct then edit the data asset and change the setting for this violation to be ignored.

Suspect values in correlated columns

This dimension identifies columns that are correlated with other columns and then uses that information to identify data assets that do not have the same correlation. Two columns are correlated if the value of one column can be predicted by the value of another column. If one column has a strong correlation with another column, but some values in the column do not show the same correlation, then the values are marked as suspect values.

The correlation violations also apply to categorical values.

In the following example, the correlation between Columns 1 and 2 is: Column 2 is 1 if Column 1 is A. Column 2 is 2 if Column 1 is B.

Column 1,   Column 2
A        1
B        2
A        1
B        2
A        1
B        2
A        2
B        2

The row that contains A and 2 can be marked as a suspect value.

In the following example, the correlation between Column 1 and Column 2 is: Column 2=2*Column1.

Column 1    Column 2
1         2
2         4
3         6
4         1
5         10
6         12

The value 4 in Column 1 and the value 1 in Column 2 are suspect because they violate the correlation rule (Column 2=2*Column1).

Addressing suspect values in correlated columns violations: Review all values that are flagged as suspect values to determine whether they are correct or invalid. If the value is invalid, correct the value or mark the row as invalid. If the value is correct, edit the data asset and change the setting for this violation to be ignored.

This dimension must be explicitly enabled in the data quality settings for your project.

Values out of range

This dimension identifies outliers in a column's data. A user can constrain the minimum and maximum values allowed for a column. Any value falling outside of that range is identified. The quality score is based on the percentage of values identified subtracted from a percentage of 100.

For example, a column consists of numbers that represent account balances. If the minimum value for these account balances is set to 0, then each value that falls below the minimum value of zero is identified. If that column has 100 values, 40 values fall below the minimum value, and no other quality dimensions are identified, the column has a quality score of 60% because 40% of the values fall outside of the column's allowed value range.

Addressing values out of range violations: You can set minimum and maximum value constraints for a column by editing the analysis results.

Rule violations

This dimension identifies values in a data asset that do not meet the conditions of an associated data rule, rule set, or quality rule. All quality rules are included in the data quality score calculation. Data rules and rule sets that are eligible to be included in the data quality score must meet the following criteria:

The rule must be valid and fully bound.
All of the rule variables must be bound to columns in the same analyzed data asset.
The data rule must not contain any columnar/aggregation functions, such as minimum, maximum, sum, or lookups to tables.

Any value that does not meet the condition of an eligible rule is identified.

For example, you want to confirm that all values in a column AGE are greater than 18. If the column has 100 values, and 3 values do not meet the conditions of that rule, and no other quality dimensions are identified, the column has a quality score of 97% because 3% of the values did not meet the conditions of the data rule. The data rule name listed in the analysis results is the name of the data rule definition, and the rule set name is the name of the rule set definition.

Learn more

Data quality project settings

Parent topic: Analysis results