Improve performance by using informational constraints in Db2 Big SQL

Introduction

Constraints are a very powerful feature that can greatly improve the performance of SQL queries. In Db2 Big SQL, all constraints that are associated with external tables (Hadoop or HBase tables) are considered to be informational only (that is, they are defined as NOT ENFORCED). If the data is guaranteed to adhere to the defined constraints, usage of informational constraints is highly recommended, because the performance benefits of using constraints can be quite significant.

Types of constraints

Constraints are associated with tables, and are either defined as part of the table creation process (using the CREATE TABLE statement) or are added to a table’s definition after the table has been created (using the ALTER TABLE statement). You can also use the ALTER TABLE statement to modify constraints. In addition to the constraint definition, attributes are associated with the constraint to specify if the constraint is enforced and whether the constraint is to be used for query optimization or not. Constraints can be dropped at any time; this action does not affect the table’s structure or the data stored in it.

There are several types of constraints:

A NOT NULL constraint is a rule that prevents null values from being entered into one or more columns within a table.
A unique constraint (also referred to as a unique key constraint) is a rule that forbids duplicate values in one or more columns within a table. For example, a unique constraint can be defined on the supplier identifier in the SUPPLIER table to ensure that the same supplier identifier is not given to two suppliers.
A primary key constraint is a column or combination of columns that has the same properties as a unique constraint. You can use a primary key and foreign key constraints to define relationships between tables.
A foreign key constraint (also referred to as a referential constraint or a referential integrity constraint) is a logical rule about values in one or more columns in one or more tables. For example, given a PARTS table and a SUPPLIER table, where each part in the PARTS table contains a column identifying the supplier ID of that part, a foreign key constraint can be defined that links the supplier ID in the PARTS table to the supplier ID in the SUPPLIER table.
A (table) check constraint (simply called a check constraint) sets restrictions on data added to a specific table. For example, a table check constraint can ensure that the salary level for an employee is at least $20,000 whenever salary data is added or updated in a table containing personnel information.
A functional dependency constraint defines a functional dependency between columns and can be defined by using the following clause:
```
     (column-name,...) DETERMINED BY (column-name,...)    
```
A functional dependency is not enforced by the database manager during normal operations such as insert, update, delete, or set integrity. The functional dependency might be used during query compilation to optimize the query.

Constraints on external tables (Hadoop and HBase)

As stated earlier, all constraints that are associated with external tables (Hadoop or HBase tables) are considered to be informational (NOT ENFORCED) only. An informational constraint is a constraint attribute that can be used by the SQL compiler to improve access to data. Informational constraints are neither enforced nor used for additional verification of data; rather, they are used to improve query performance. In some cases, the performance benefits can be quite significant.

IMPORTANT: Because these constraints are not enforced, if the table data violates the constraints, incorrect results can occur. Therefore, it is critical that data adheres to the constraint definition.

The following example illustrates the use of informational constraints and shows how they work. A simple table contains information about applicants’ age and gender:

  create hadoop table myschema.applicants (
      ap_no integer not null,       
      gender char(1) not null,
      constraint genderok check (gender in ('M', 'F')) not enforced enable query optimization,
      age integer not null,         
      constraint ageok check (age between 1 and 65) not enforced enable query optimization
    );

This example contains two clauses that change the behavior of the column constraints. The first clause is NOT ENFORCED, which instructs the database manager not to enforce the checking of this column when data is inserted or updated. The second option is ENABLE QUERY OPTIMIZATION, which is used by the database manager when optimizing the SQL and can greatly improve the performance of queries that reference this table.

If the table is defined with the NOT ENFORCED option, the behavior of INSERT statements might appear odd. The following example will not return errors when it runs against the APPLICANTS table:

  insert into myschema.applicants values
    (1, 'M', 54), (2, 'F', 38),
    (3, 'M', 21), (4, 'F', 89),
    (5, 'C', 10), (6, 'S', 70);

Applicant number 5 has gender C (for child), and applicant number 6 has both gender S (for senior) and a value for age that exceeds the defined constraint. In both cases, the database manager will allow the insert to occur, because the constraints are not enforced. A SELECT statement against the table returns 0 rows:

  select *
    from myschema.applicants
    where gender = 'C';

This is an unexpected result! Even though the value ‘C’ is found within the table, the constraint on this column tells the database manager that the only valid values are ‘M’ and ‘F’. The ENABLE QUERY OPTIMIZATION clause allows the database manager to use this constraint when optimizing the statement.

In this situation, either the data should be changed to match the defined constraint or the constraint should be updated to ensure that the data does not violate the constraint.

For Db2 Big SQL users, data validation is normally done by an ETL tool. Or the data might be sourced from another RDBMS that has the same constraints defined. If the data is not guaranteed to be consistent with the defined constraints, there is a risk of unexpected results.

If a query is returning incorrect results because of an informational constraint, you can use an ALTER TABLE statement to change the constraint by disabling the QUERY OPTIMIZATION attribute, as shown in the following example:

  alter table myschema.applicants
    alter check genderok disable query optimization;

If the query is run again, it returns correct results.

Note: You also have the option to drop the constraint. Dropping the constraint is the only option for primary key and unique constraints.

Another commonly used constraint is a unique or primary key constraint. For example:

  create hadoop table myschema.keys (
      key integer not null,
      constraint key_constraint unique(key) not enforced enable query optimization,
      value varchar(10)
    );

If a query requires only distinct values from a column, and a primary key or unique constraint is defined on that column, the Db2 Big SQL optimizer knows that the column contains only unique values and that no operations are needed to ensure this uniqueness. However, if a unique constraint was not defined on the column, the optimizer must sort and aggregate the data during query execution to guarantee uniqueness. The performance difference between these two scenarios can be quite remarkable.

The best scenario for using informational constraints occurs when you can guarantee that the data conforms to the defined constraints. If your application already checks the table data beforehand, using informational constraints can result in significantly improved query performance.

You can use the database catalog to identify the constraints that are defined on tables. Given the previous examples, the following query shows that for the two referenced tables, there are three defined constraints: two check constraints on the APPLICANTS table, and one unique constraint on the KEYS table.

  select
      varchar(constname,20) as constraint_name,
      varchar(tabname,20) as table_name,
      type
    from syscat.tabconst
    where tabschema='MYSCHEMA';

The following example shows how to use a SELECT statement to get additional information about the check constraints definitions:

  select
      varchar(constname,20),
      varchar(tabname,20),
      varchar(text,25)
    from syscat.checks
    where tabschema='MYSCHEMA';

The following table lists some of the database catalog views that contain information about constraints.

Catalog view	Description	Query example
SYSCAT.CHECKS	Contains a row for each table check constraint	select constname, tabname, text from syscat.checks
SYSCAT.COLCHECKS	Contains a row for each column that is referenced by a table check constraint	select constname, tabname, colname, usage from syscat.colchecks
SYSCAT.KEYCOLUSE	Contains a row for each column that participates in a key defined by a unique, primary key, or foreign key constraint	select constname, tabname, colname, colseq from syscat.keycoluse
SYSCAT.REFERENCES	Contains a row for each referential constraint	select constname, tabname, refkeyname, reftabname, colcount, deleterule, updaterule from syscat.references
SYSCAT.TABCONST	Contains a row for each unique (U), primary key (P), foreign key (F), or table check (K) constraint	select constname, tabname, type from syscat.tabconst

Summary

Constraints are a very powerful feature that can greatly improve the performance of queries, but extreme care must be taken when using informational constraints to ensure that the data adheres to the constraint definitions.

Tips

Improve performance by using informational constraints in Db2 Big SQL

Technical Blog Post

Abstract

Body

Introduction

Types of constraints

Constraints on external tables (Hadoop and HBase)

Summary

UID

Share your feedback

Need support?