Ensuring data quality through data contracts
A data contract is a formal agreement between a data producer and a data consumer and defines among other things the expected structure, schema, and quality of data. It ensures that data meets the requirements and aligns with business definitions.
Data contracts for use in IBM Knowledge Catalog must be set up in YAML or JSON format and conform to the Open Data Contract Standard (ODCS).
Tech preview This is a technology preview and is not yet supported for use in production environments.
- Required permissions
- You must have the Admin or the Editor role in the project and the Manage data quality assets and Measure data quality user permissions.
Data contracts can be fully managed and enforced through the Data Product Hub UI. However, you can also manage and version contract files in YAMl or JSON format in an external source control system such as Git and then run data quality validations directly by using the Data Contract Enforcement API.
When you create a contract in YAML or JSON format, make sure to include the following properties in addition to the ones that are required by ODCS:
- Provide a name for the data contract with the
nameproperty. - Include a
serverssection and set theschemaproperty for each server that you define. - Include a
schemasection and set thephysicalNameproperty for each schema object that you define. - For each quality object, provide a name with the
nameproperty.
The Data Contract Enforcement API provide methods that you can use for these tasks:
- Upload a new data contract or update an existing one.
- Run data quality validation tests against data based on that contract.
- Retrieve the test results.
The API calls require one or more of these parameters:
- project_id
- The ID of the project that you want to use as workspace for your validations.
- data_contract_id
- The ID of the data contract against which you want to validate your data. You can retrieve the ID from the
idfield in the response when you create a data contract. Alternatively, you can submit aGET /data_quality/v4/projects/{project_id}/data_contractscall to list all data contracts within a project.
The following types of data quality checks are validated for data contracts:
- Rules in SQL format
- These column-level library metrics:
- Null values
- Missing values
- Invalid values
- Duplicate values
- These schema-level library metrics:
- Row count
- Duplicate values
For each check, you can provide scheduling information in the data contract by using a cron expression. However, the test is run only once in IBM Knowledge Catalog even if the cron expression in the data contract defines recurring runs.
For more information about these data quality checks and how to set up scheduling, see Open Data Contract Standard: Data Quality.
Enforcing a new data contract
You want to enforce a data contract that does not yet exist in the project:
-
Create the contract in YAML or JSON format, for example, in Git.
-
Optional: Validate the contract against the syntax that is defined in the ODCS standard before you create the contract in the project:
POST /data_quality/v4/projects/{project_id}/data_contracts_validation -
Create the contract in the project:
POST /data_quality/v4/projects/{project_id}/data_contracts -
Run the test:
POST /data_quality/v4/projects/{project_id}/data_contracts/{data_contract_id}/testThe data assets and the SQL rules that are defined in the data contract are created in the project and the rules are run, either directly or as scheduled. Even if a recurring schedule is defined, the test runs only once.
If you want to remove the data quality rules from the project after the test is complete, set the
retain_dq_objectsparameter of the call tofalse. -
Retrieve the test results:
GET /data_quality/v4/projects/{project_id}/data_contracts/{data_contract_id}/test_results -
Review the results to determine whether the data met the defined quality standards.
Enforcing an updated data contract
You updated the data contract and need to retest your data. For example, the data that is subject to the data contract or the data quality requirements changed.
-
Update the contract in the source repository, for example, in Git.
-
Update the contract in the project:
PUT /data_quality/v4/projects/{project_id}/data_contracts/{data_contract_id} -
Run the test:
POST /data_quality/v4/projects/{project_id}/data_contracts/{data_contract_id}/testDepending on the changes in the data contract, data assets and SQL rules are updated or added, and the rules are run, either directly or as scheduled. Even if a recurring schedule is defined, the test runs only once.
-
Retrieve the test results:
GET /data_quality/v4/projects/{project_id}/data_contracts/{data_contract_id}/test_results -
Review the results to determine whether the data met the defined quality standards.
Finding a specific data contract, and testing or retesting as required
You want to find out whether a specific contract exists in your project and was enforced to decide whether you must rerun the data quality tests or even create a new contract.
-
Identify the contract that you want to look for in the source repository. Note the name, ID, or any metadata that you can match against what’s in the project.
-
List the data contracts that exist in the project:
GET /data_quality/v4/projects/{project_id}/data_contracts -
If the contract that you are looking for exists in the project, check for existing results:
GET /data_quality/v4/projects/{project_id}/data_contracts/{data_contract_id}/test_results-
If results are available and are still current, no further action is needed.
-
If no results are available or if the existing results are outdated, for example, because they stem from an older contract or older data, run the test:
POST /data_quality/v4/projects/{project_id}/data_contracts/{data_contract_id}/test -
Retrieve the results:
GET /data_quality/v4/projects/{project_id}/data_contracts/{data_contract_id}/test_results -
-
If the contract that you are looking for does not exist in the project, follow the steps in Enforcing a new data contract.
Deleting contracts, rules, or results
You can delete data contracts and any test results from a project:
-
Delete one or more contracts:
DELETE /data_quality/v4/projects/{project_id}/data_contractsProvide the contract IDs as a comma-separated list.
-
Delete the test results for a specific contract:
DELETE /data_quality/v4/projects/{project_id}/data_contracts/{data_contract_id}/test_results