Adding a data quality validation rule

The following use case shows an end-to-end process of creating a data quality validation rule, including a sample SQL query, its testing, the query results, and the results analysis.

You can create queries only for the warehouses that you integrated with Databand, which are monitored by a data at rest monitor (Snowflake and BigQuery). Regardless of the selected warehouse, you add and run a query in the same way.

In the use case, we are going to count the number of customers in the CUSTOMERS table who meet the following two conditions:

  • Their phone_number field does not have exactly 10 characters.
  • The last_update_time field is between a specific date (which is one day before the execution_date) and the execution_date.

To create the data quality validation rule:

  1. Go to the Datasets > Data quality page and click Add data quality query.

  2. In the opened window select the warehouse.

  3. Provide an SQL query. When you create your query, follow the guidelines that are provided in the wizard. Make sure that your SQL includes time-based partitioning by using the {{ execution_date }} Jinja template. It is replaced with the actual execution time during the query run. Because the system provides the result of the query as a numerical value, you need to create a query that returns a number. For example, you can do this by using the count() function.

    SQL expression example for a Snowflake warehouse
    SELECT
      COUNT(*)
    FROM
      CUSTOMERS
    WHERE
      LENGTH(phone_number) <> 10
    AND
      last_update_time BETWEEN dateadd(day, -1, '{{ execution_date }}' ) and '{{ execution_date }}'
    

    SQL expression example for a BigQuery warehouse
    SELECT
      COUNT(*)
    FROM
      CUSTOMERS
    WHERE
      LENGTH(phone_number) <> 10
    AND
      last_update_time BETWEEN DATE_SUB('{{ execution_date }}', INTERVAL 1 DAY) AND '{{ execution_date }}'
    

  4. Click Test query to make sure that your query is valid.
    Each time that you run the query you are billed by your database provider.

  5. After you test the query, you can check:

  • The query result - in our case it was completed successfully.
  • The value returned - in our case the number of customers who meet the conditions that are specified in the query.
  • The query duration.

If your query failed, you need to change it and retest it. The positive outcome of the test is a prerequisite to proceed to the next steps.

  1. If the run was successful, you can continue to the next step.
  2. Optional: Schedule runs by using CRON expressions. We want to schedule our query to run at the top of every hour, hence we create the following expression: 0 * * * *.
    The CRON expression is based on the UTC time zone.
  3. Optional: You can also make sure that the query will stop automatically after a time frame specified by you. To do so, select and complete the checkbox.

You can set the time frame up to 60 minutes.

  1. Click Next.
  2. You can provide the name of the query. Click Save query.
  3. The Data quality page displays the details of the saved query and the results of its run, such as:
  • Name
  • Data warehouse on which the query was performed
  • Returned values (presented in a form of a bar chart)
  • Schedule of the query runs
  • Time of the last run
  • Query run metrics (with the deviation from the average returned value)