Adding a data quality validation rule
The following use case shows an end-to-end process of creating a data quality validation rule, including a sample SQL query, its testing, the query results, and the results analysis.
You can create queries only for the warehouses that you integrated with Databand, which are monitored by a data at rest monitor (Snowflake and BigQuery). Regardless of the selected warehouse, you add and run a query in the same way.
In the use case, we are going to count the number of customers in the CUSTOMERS table who meet the following two conditions:
- Their
phone_number
field does not have exactly 10 characters. - The
last_update_time
field is between a specific date (which is one day before theexecution_date
) and theexecution_date
.
To create the data quality validation rule:
-
Go to the Datasets > Data quality page and click Add data quality query.
-
In the opened window select the warehouse.
-
Provide an SQL query. When you create your query, follow the guidelines that are provided in the wizard. Make sure that your SQL includes time-based partitioning by using the {{ execution_date }} Jinja template. It is replaced with the actual execution time during the query run. Because the system provides the result of the query as a numerical value, you need to create a query that returns a number. For example, you can do this by using the
count()
function.SQL expression example for a Snowflake warehouse
SELECT COUNT(*) FROM CUSTOMERS WHERE LENGTH(phone_number) <> 10 AND last_update_time BETWEEN dateadd(day, -1, '{{ execution_date }}' ) and '{{ execution_date }}'
SQL expression example for a BigQuery warehouse
SELECT COUNT(*) FROM CUSTOMERS WHERE LENGTH(phone_number) <> 10 AND last_update_time BETWEEN DATE_SUB('{{ execution_date }}', INTERVAL 1 DAY) AND '{{ execution_date }}'
-
Click Test query to make sure that your query is valid.
Each time that you run the query you are billed by your database provider. -
After you test the query, you can check:
- The query result - in our case it was completed successfully.
- The value returned - in our case the number of customers who meet the conditions that are specified in the query.
- The query duration.
If your query failed, you need to change it and retest it. The positive outcome of the test is a prerequisite to proceed to the next steps.
- If the run was successful, you can continue to the next step.
- Optional: Schedule runs by using CRON expressions. We want to schedule our query to run at the top of every hour, hence we create the following expression: 0 *
* * *.
The CRON expression is based on the UTC time zone. - Optional: You can also make sure that the query will stop automatically after a time frame specified by you. To do so, select and complete the checkbox.
You can set the time frame up to 60 minutes.
- Click Next.
- You can provide the name of the query. Click Save query.
- The Data quality page displays the details of the saved query and the results of its run, such as:
- Name
- Data warehouse on which the query was performed
- Returned values (presented in a form of a bar chart)
- Schedule of the query runs
- Time of the last run
- Query run metrics (with the deviation from the average returned value)