Creating SQL-based data quality rules
You can create SQL-based data quality rules for assessing the quality of the data in your project. Use such rules for measuring errors rather than for evaluating compliance with specific quality criteria.
To have an SQL-based data quality rule contribute to the data quality scores of a column and the containing asset, add that column as a related item to the corresponding rule. Use the Validates the data quality of relationship type. The same score and issues are reported for all columns that are linked with this relationship type.
- Requirement
- For AI-generated rule descriptions and expression explanations, the following project settings must be enabled:
- Data intelligence settings: use of generative AI capabilities in the project
- Data quality settings: the option to explain data quality rules with AI
To create an SQL-based data quality rule:
-
Open a project, click New asset > Measure and monitor data quality.
-
Define details:
- Specify a name for the data quality rule.
- Optional: Provide a description. If AI-generated descriptions are enabled, an expression explanation and a description that is based on that explanation are generated when you save the rule. This description is then automatically updated whenever you change the SQL statement unless you edited the description.
- Optional: Select a data quality dimension to describe the primary data quality metric for the rule logic in this asset. The selected dimension can be used as report category, for filtering, or for visualizing selected data.
-
Select the SQL query rule logic.
-
Specify the source of the data to check by selecting an existing connection or creating a new one. The connection must support SQL queries. For supported data sources, see Supported data sources for curation and data quality.
If you select an existing connection and this connection was created with personal credentials, you will need to unlock the connection.
-
Create your SQL query. You can craft the SQL query yourself, or enter a plain text query and have that converted to SQL by using one of the available models.
The option to work with plain text queries is technology preview functionality. Technology preview functionality is available for use in development and testing, but is not meant for use in production.
To work with plain text queries, the project must be enabled for natural language queries, but this is a one-time operation. For details, see Data intelligence tools settings.
You can provide additional context to the models for more accurate and context-aware SQL output by uploading query samples. For more information, see Providing additional context for text-to-SQL conversions.
Remember: Generated queries are not guaranteed to be identical from run to run, so that you might see variations.-
In the free-form text area of the SQL tab, enter your text query, select one of the available models (
), and generate the SQL statement. Because of how SQL-based rules work, make sure to enter a query that returns the exception records. For example, “Order date is in the future” or “Customer age is below 18”.
Tip: Themeta-llama/llama-3-3-70b modelis better at following prompt instructions for converting text into SQL and thus might provide more accurate results compared to Granite models.When you enter your text query, you can provide additional information that serves as custom prompt to the model. Precede that additional information with the disclaimer
Note:.Project assets from the selected connection and database are used as reference assets for translating your text query into an SQL query. By default, the large language model selects up to 40 most relevant assets from these reference assets for the translation. To select reference assets manually, click Reference assets. Do not select any locally uploaded CSV files. Such files are not considered for generating SQL queries.
-
Copy the generated SQL statement to the SQL area. You can still update the SQL query as required.
Whether you craft the query yourself or use a generated query, make sure that the SELECT statement meets the following conditions:
- The statement returns only columns with unique names. Columns with duplicate names will cause validation errors when they are run even if the rule test passed.
- The statement returns the number of records that do not meet your data quality condition. SQL-based rules work in a different way than rules created from data quality definitions. They report the records that the SELECT statement returns
as
failedorNot met. Plus, the total number of records that is reported equals the number of returned records, not the number of checked records.
For example, assuming you have a table
db2admin.credit_cardwith 31 rows and you want to check how many records with card type AMEX are in the table, the difference is as follows:Data quality rule from a data quality definition
You check for records where the card type is AMEX.
Expression in the data quality definition:Col = 'AMEX'
Bound expression in the data quality rule:credit_card.card_type = 'AMEX'
Sample result:Total: 31 | Met: 4 (12.9%) Not met: 27 (87.1%)SQL-based data quality rule
You check for records where the card type is not AMEX.
SELECT statement:select card_type from db2admin.credit_card where card_type <> 'AMEX'
Sample result:Total: 27 | Met: 0 (0%) Not met: 27 (100%)Also check the set of Sample SQL statements for data quality rules. These samples demonstrate how you can write SQL rules to return records that do not meet your quality criteria. You can copy the provided statements into your own data quality rules and adjust them as needed.
-
Consider these conventions for specifying column, table, and schema names in your SELECT statement:
- Table and schema names in PostgreSQL data sources are case-sensitive. You might need to enclose the names in double quotation marks like in this example: "schema"."table_name"
- Try to avoid
SELECT *queries. Such queries might cause validation errors when column names change. Narrow down the column selection. - If the column name does not start with an alphabetic character or contains characters other than alphabetic characters, numeric characters, or underscores, use an alias for the column name.
At any time, you can test the SQL statements. Note that the test returns only the names of the columns selected by your query. No actual processing is done. A validity check is done when you click Next. You can't proceed unless your query passes this check.
In projects where AI-generated content is enabled, an explanation for the SQL statement is generated when you save the rule. The explanation is updated for every change to the SQL statement.
-
-
Configure output settings and content.
Select whether you want rule output to be written to a database. If not, only some statistical information is provided in the rule's run history. For more information, see Configuring output settings for data quality rules.
-
Review your configuration. To make sure your rule is configured properly, you can test it before you actually save it to the project. The output of the rule test is directly displayed and matches what you configured in the output settings.
The output of the rule test is directly displayed. In general, a maximum of 100 exception records is shown. If you configured any output settings, the output matches your configuration. If you didn't configure any output settings, the columns from the SELECT statement are shown.
To change the configuration, click the Edit icon
on the tile and update the settings. When your review is complete, click Create or Create rule & define job. For both create options, the rule and its related DataStage flow are added to the project. The default name of the DataStage flow is
DataStage flow of data rule <rulename>. If you select to create only the rule, a job with default settings is created when you run the rule directly. Or, you can create a job with customized settings anytime later. If you select to also define a job, you can directly configure the job settings. For more information about the settings, see Creating jobs for data quality rules.
If your rule is configured properly without any information missing, it has the status Ready. This status means that the rule can be run. The rule status Not ready indicates that the rule can't be run because of SQL syntax errors, modified dependencies, or other rule definition issues. For example, the password for accessing the data source changed. This status is more likely to be seen for data quality rules that were created by using the IBM Knowledge Catalog API: Create data quality rule. When you create data quality rules by using the API, make sure to also test and validate the rule.
To confirm that a rule is still valid before you run it manually, you can check the rule status by selecting Validate from the overflow menu.