Data quality assets

Design data quality assets for analyzing and monitoring the data quality in a project.

You can have the following data quality assets in your project:

Before you start creating data quality definitions and rules, give the following topics some thought:

  • What do you want to analyze and monitor?
  • Which elements do you need to evaluate?
  • What's the goal of the analysis, such as checking for completeness, validity, and so on?
Required permissions
To create, edit, or delete data quality assets, you must have the Manage data quality assets user permission and the Admin or the Editor role in the project.

Data quality definitions

A data quality definition represents a generic form of a data quality rule. It describes the rule evaluation or condition by using logical variables that are not tied to any actual data. Thus, it can be used in any number of data quality rules. If you change the data quality definition, you also change the validation logic for all rules derived from the definition.

You create and manage data quality definitions in projects. To make a data quality definition available for re-use in other projects, you can publish it to a catalog.

Properties for data quality definitions

In addition to the common asset properties, data quality definitions have the following properties in projects. Users with the appropriate permissions can edit all properties.

For more information about the common properties, see Common properties for assets.

Rule expression
The defined rule logic. Changes to the rule expression affect all rules derived from this data quality definition.
 
Data quality dimension
Optional. The primary data quality metric for the rule logic. The selected dimension can be used as report category, for filtering, or for visualizing selected data.
 
Governance artifacts
Optional. The business terms and governance rules that users assigned to the asset.
 
Custom properties
Optional. Any custom properties that are defined for this asset type.

Data quality rules

A data quality rule links or binds logical variables to actual data for evaluation. A rule is run against physical data to assess the quality of your data by evaluating and validating specific conditions. Each rule run provides statistics and information about potential exceptions as defined for the rule's output table.

You create, manage, and run data quality rules in projects. To make a data quality rule available for re-use in other projects, you can publish it to a catalog.

You can't publish the following data quality rules to catalogs:

  • Parameterized data quality rules
  • Data quality rules with subflows

Also, when you copy a data quality rule or a data quality definition from a catalog to the same project multiple times, multiple data quality rules or data quality definitions with the same name might get created depending on your project configuration for duplicate asset handling.

You can create rules from one or more data quality definitions or you can create data quality rules with SQL statements. Rules that are built from data quality definitions capture which columns comply with the rule conditions and which don't. SQL-based rules are better suited to check for noncompliant records.

For example, you want to validate tax identifiers. So your concepts could be TaxID exists and Validate TaxID.

Now, you have these options:

  • Create rules from data quality definitions. For either concept, you can create a data quality definition with evaluation logic for the logical variable tax_id. The first condition is that the tax identifier (or TaxID) must exist, and the second condition is that the tax identifier must meet a defined format.

    Data quality definition TaxID exists: tax_id exists
    Data quality definition Validate TaxID: tax_id matches_format 'AA99-A999-9999'

    Then, select one of these options:

    • For each column that contains a tax identifier to be validated, define two data quality rules. The first rule binds the logical variable tax_id of the definition TaxID exists to the column. The second rule binds the logical variable tax_id of the definition Validate TaxID to the column.
    • For each column that contains a tax identifier to be validated, define one data quality rule and use both data quality definitions in that rule. Bind the logical variable tax_id in either definition TaxID exists and Validate TaxID to the column.
    • Define one data quality rule and use both data quality definitions in that rule. Bind the logical variable tax_id in either definition TaxID exists and Validate TaxID to a parameter set of the type Parameter from column. Add all columns that contain a tax identifier to be validated to that parameter set.
  • Create an SQL-based rule: select tax_id from taxschema.taxtable where tax_id is null or not regexp_like(tax_id, '^[a-zA-Z]{2}[0-9]{2}-[a-zA-Z][0-9]{3}-[0-9]{4}$')

Properties for data quality rules

In addition to the common asset properties, data quality rules have the following properties in projects. Users with the appropriate permissions can edit all properties.

For more information about the common properties, see Common properties for assets.

Bound expressions
The list of rule expressions. You can see information about a binding target by hovering over the tag in the bound expression.
Rule type: Simple rule
 
Rule expressions
The list of rule expressions.
Rule type: Rule with external bindings
 
SQL connection
The connection to the data source where the rule will be applied.
Rule type: SQL-based rule
 
SQL statements
The SQL query that makes up the rule.
Rule type: SQL-based rule
 
Data quality dimension
Optional. The primary data quality metric to which the data quality rule contributes.
Rule type: Simple rule, rule with external bindings
 
Related DataStage flow
The DataStage flow that provides the bindings for the rule.
Rule type: Rule with external bindings
 
Data quality definitions
The data quality definitions that provide the rule expressions.
Rule type: Simple rule, rule with external bindings
 
Governance artifacts
Optional. The business terms and governance rules that users assigned to the asset.
Rule type: Simple rule, rule with external bindings, SQL-based rule
 
Custom properties
Optional. Any custom properties that are defined for this asset type.
Rule type: Simple rule, rule with external bindings, SQL-based rule
 
Relationships
Relationships appear in the Related objects section and can be between the data quality rule and assets or columns in the same project, or between the rule and an artifact.
The following relationships are automatically created:
  • For all rule types, an Is implemented by relationship with the associated DataStage flow
  • For simple rules:
    • An Is used by relationship with the associated DataStage flow.
    • A Validates data quality of relationship with each bound column.
    • A Uses binding relationship with each parent asset of a bound column. Do not edit such relationships.
  • For rules with externally managed bindings, Is implemented by and Is used by relationships with the associated DataStage subflow
  • For simple rules and rules with externally managed bindings, an Implements relationship for each of the referenced data quality definitions
Add related objects as required.
The Data quality page shows aggregated information for each column that is listed here with a Validates the data quality of relationship. For rules with external bindings, quality scores are reported for columns with this relationship only if no columns for quality score reporting are configured in the rule subflow stage.
 
Selected output
Optional. If configured, the columns of the rule output table.
Rule type: Simple rule, rule with external bindings, SQL-based rule

Porting data quality assets between projects

You can export data quality assets from a project and import them to another one as described in Exporting project assets. In addition to the data quality definitions and rules, you can select the following items associated with data quality rules to be included in the project export:

  • Connections

    Important: If a data quality rule uses a connection with personal credentials, unlock that connection after the import before you run the rule.
  • Data assets that are used in bindings

  • DataStage flows and subflows

  • Jobs

  • Data assets that were created for rule output tables

The following items are not exported:

  • Rule run history
  • Any governance artifacts that are associated with a data quality asset
  • Project-level output settings
  • Rule-generated data quality information

If you defined any custom properties on the data quality definitions and rules that you export, make sure that the same custom properties with the exact unique identifiers are defined in the target system before you import the project.

Learn more