Contents


Distribution statistics uses with the DB2 optimizer

Create efficient access plans for faster SQL

To execute a query or a DML statement (INSERT, UPDATE, DELETE), DB2 has to create an access plan. An access plan defines the order for accessing tables, which indexes are used, and by what kind of join methods data is correlated. A good access plan is key for the fast execution of an SQL statement. The DB2 optimizer creates the access plans. The DB2 optimizer is a cost-based optimizer, which means it makes decisions based on statistics that are available for the tables and indexes. When generating statistics, DB2 offers the possibility to create so-called distribution statistics in addition to the basic statistics. An understanding of distribution statistics is not only relevant for database administrators, but is also relevant for application developers. Application developers have to be careful, as in certain situations distribution statistics are important for the DB2 optimizer. Usage of host variables or parameter markers (java.sql.PreparedStatement in Java) may hinder the optimizer to make optimal use of distribution statistics. This article explains distribution statistics, when they are important, and what application developers should consider so that the DB2 optimizer can create efficient access plans.

Basic statistics and distribution statistics

Before examining distribution statistics, let's have a look at the basic statistics that are always collected for a table when RUNSTATS is executed.

Statistics for the table:

  • The number of pages in use
  • The number of pages that contain rows
  • The number of rows that overflow
  • The number of rows in the table (cardinality)
  • For MDC tables, the number of blocks that contain data

Statistics for each column in the table:

  • The cardinality of the column
  • The average length of the column
  • The second highest value in the column
  • The second lowest value in the column
  • The number of NULLs in the column

In general, when executing RUNSTATS, statistics are not only collected for a table, but are also collected for the corresponding indexes. For the statistics gathered for indexes, see the DB2 Administration Guide: Performance - Statistical information that is collected.

Looking at the basic statistics for a table, you can see that the DB2 optimizer knows how many rows a table consists of (table cardinality) and how many distinct values a column contains (column cardinality). But there is also information that the basic statistics cannot answer. For example, the basic statistics cannot tell the optimizer about the frequency of certain values in a column. Assume a table, TABLE_X, with about 1,000,000 rows has the following query executed on the table:

SELECT * FROM TABLE_X WHERE COLUMN_Y = 'VALUE_Z'

Wouldn't it be important for the DB2 optimizer to know how many rows of TABLE_X satisfy the condition COLUMN_Y = 'VALUE_Z'? In other words: Wouldn't it be good to know if the query returns 1, 100, 1,000, or 10,000 rows?

Indeed, from the basic statistics, the DB2 optimizer can only estimate the frequency of 'VALUE_Z' in COLUMN_Y. In this case, the optimizer assumes an even distribution of all values in COLUMN_Y, which means it assumes that all values have the same frequency. The estimate is okay if this is approximately true. But if some values appear more often than others (for example, if 'VALUE_Z' appears 900,000 times, that is in 90% of all rows), this cannot be considered by the optimizer and may lead to a sub-optimal access plan. This is where distribution statistics come into play. Distribution statistics provide the optimizer information about the frequency and the distribution of data and complement the basic statistics in an important way if many duplicates are stored in a database and data is not evenly distributed in the tables.

Types of distribution statistics - Frequency and quantile statistics

There are different types of distribution statistics -- frequency and quantile statistics. Let's examine the two different types by looking at a sample table.

A sample table, "CARS," representative of an automobile manufacturer, contains a row for each produced car. A car can be identified by its ID, so that "ID" is the primary key (PK) of the table "CARS." In addition, there is a column "STATE" that indicates the step in the manufacturing process in which a car is currently. The manufacturing process of a car starts with step 1, followed by the steps 2, 3, ..., 49, 50, 51, ..., 98, 99, and ends with step 100 -- step 100 means that a car is completely finished. The rows of the cars completely finished also remain in the table and are used by follow-up processes like complaint management, quality assurance, and so on. The automobile manufacturer produces 10 different car models (column "TYPE"). To simplify things, the models are named A, B, C, D, ..., J in the sample table. In addition to the primary key index (on column "ID"), there is an index on column "STATE" ("I_STATE") and an index on column "TYPE" ("I_TYPE"). In reality, a table "CARS" would comprise many more columns than "ID," "STATE," and "TYPE." In the sample table, these other columns are missing for clarity reasons.

Frequency statistics

Let's assume the table CARS currently has about 1,000,000 entries and the different models appear with the following frequency:

Table 1. Frequency statistics for column TYPE of table CARS
TYPECOUNT(TYPE)
A506135
B301985
C104105
D52492
E19584
F10123
G4876
H4589
I4403
J3727

Model A is the car that buyers choose most often, therefore 50% of all produced cars are of this type. Model A is followed by the models B and C with 30% and 10% of all cars respectively. The rest of the models make up only 10% altogether.

The table shows the exact frequency statistics for column "TYPE." From the basic statistics, the DB2 optimizer only knows that the table contains 1,000,000 rows (table cardinality) and 10 different values (= models), namely A to J. Without distribution statistics, the optimizer assumes that each value appears with the same frequency, which is about 100,000 times. When distribution statistics are generated for the column "TYPE," the optimizer knows the real frequency of each model. Therefore the optimizer is aware of the different frequencies of the available models.

Frequency statistics are used by the optimizer to calculate filter factors for predicates that check for equality or inequality. For example:

SELECT * FROM CARS WHERE TYPE = 'H'

Quantile statistics

In contrast to frequency statistics, quantile statistics are not about the frequency of different values but about how many rows of a table are below or above a certain value (or how many rows are between two values). Quantile statistics provide the information whether the values of a column are clustered or not. To get this information, DB2 assumes an ascending sort order of a column's values and determines in regular row intervals the corresponding values.

Let's look at the column "STATE" of the table CARS in ascending sort order. In regular row intervals, the corresponding values of "STATE" are determined.

Table 2. Quantile statistics for column STATE of table CARS
COUNT(row)STATE ASC
54791
5494810
10999021
15988531
21505042
26525152
32016763
37005773
42487284
47508794
504298100
...100
1012019100

Because completely finished cars are not deleted from the table, the number of cars in state 100 (= completely finished) is higher than the number of cars in all other states. The completely finished cars make up 50% of all entries in the table CARS.

Note: In reality, the number of completely finished cars would be even higher (for example, over 99%). We'll look at this case later in a concrete sample.

The table shows the quantile statistics for the column "STATE." With the information about how many rows are below and above the determined values respectively, the optimizer can calculate filter factors for predicates that test for smaller (- equal), bigger (- equal), or between. For example:

SELECT * FROM CARS WHERE STATE < 100

SELECT * FROM CARS WHERE STATE BETWEEN 50 AND 70

The filter factors calculated based on the available quantile statistics are not exact, but the estimate error is below 5%, even when only 20 values are collected.

Usage of distribution statistics by the DB2 optimizer - A sample

Let's look at a complete sample where the DB2 optimizer can use distribution statistics to better estimate filter factors that lead to a better access plan.

The sample query reads data from the table CARS that has already been defined. For the automobiles' data in the table CARS, the following assumptions apply:

  • The table cardinality is 1,000,000, which means the table consists of 1,000,000 rows
  • 99.9% of all cars (that is 999,000) in the table are completely finished (column "STATE" = 100), as the information about these cars has to be retained for follow-on processes (complaint management, quality assurance, and so on.). The remaining 1,000 cars are currently going through the manufacturing process.
  • The 10 different car models (column "TYPE") A to J offered by the manufacturer nearly appear with the same frequency in the table.

Note: The script create_table_cars.sql, which you can download from this article, creates such a table CARS, including indexes on the columns "STATE" and "TYPE." The results of this sample can be reproduced with this table. The sample table was created within the DB2 SAMPLE database (command db2sampl) using default settings for DBM CFG and DB CFG.

The sample query selects all cars of model A that are currently going through the manufacturing process:

SELECT * FROM CARS WHERE STATE < 100 AND TYPE = 'A'

First analyze the access plan that the optimizer chooses when no distribution statistics are available, which is only basic statistics for the table CARS, and that its indexes exist.

Figure 1. Access plan for sample query without distribution statistics
Access plan for sample query without distribution statistics
Access plan for sample query without distribution statistics

As the optimizer is not aware of the uneven distribution of the STATE values, it decides to use the index I_TYPE. Usage of this index is costly because about 100,000 RIDs (record IDs) have to be read from the index before the table CARS is accessed. In addition, the assumption regarding the number of rows returned by the query is wrong. Because the optimizer assumes that all manufacturing steps (1 to 100) have the same frequency, it does not expect that the predicate STATE < 100 filters a number of rows worth mentioning. But as you know, this is the case because only 1,000 of all the 1,000,000 cars are currently going through the manufacturing process.

When the query is executed without distribution statistics, a snapshot for dynamic SQL returns the following monitor values (given that the required monitor switches are active):

Listing 1. Snapshot for sample query without distribution statistics
Number of executions                       = 1
Number of compilations                     = 1
Worst preparation time (ms)                = 9
Best preparation time (ms)                 = 9
Internal rows deleted                      = 0
Internal rows inserted                     = 0
Rows read                             = 99336
Internal rows updated                      = 0
Rows written                               = 0
Statement sorts                            = 0
Statement sort overflows                   = 0
Total sort time                            = 0
Buffer pool data logical reads        = 8701
Buffer pool data physical reads            = 8131
Buffer pool temporary data logical reads   = 0
Buffer pool temporary data physical reads  = 0
Buffer pool index logical reads       = 165
Buffer pool index physical reads           = 155
Buffer pool temporary index logical reads  = 0
Buffer pool temporary index physical reads = 0
Total execution time (sec.ms)         = 0.530903
Total user cpu time (sec.ms)               = 0.280403
Total system cpu time (sec.ms)             = 0.230332
Statement text                             = SELECT ID, TYPE, STATE FROM SAMPLE.CARS
                                             WHERE STATE < 100 AND TYPE = 'A'

At the moment, we will not further analyze these values, but remember them to compare them later with the monitor values for the same query executed when distribution statistics are available.

Next, generate distribution statistics for the table CARS and execute the query again. In this case, the optimizer chooses the following access plan:

Figure 2. Access plan for sample query with distribution statistics
Access plan for sample query with distribution statistics
Access plan for sample query with distribution statistics

The costs for this access plan are clearly below the costs for the access plan without distribution statistics: 203.809, instead of 3242.63. This is because the optimizer now knows that the predicate STATE < 100 has a high filter factor and therefore only returns the nearly 1,000 cars currently going through the manufacturing process. So, in this case, the table CARS is not accessed using the index I_TYPE but using the index I_STATE. Besides, the total number or rows in the result set is now estimated correctly. 1,000 cars have not been finished yet, and the different models appear with the same frequency. This leads to a result set that consists of about 100 rows.

The access plan with distribution statistics is better than the access plan without distribution statistics. But does this have an impact on the query's execution time, too? Listing 2 contains the corresponding snapshot monitor data:

Listing 2. Snapshot for sample query with distribution statistics
Number of executions                       = 1
Number of compilations                     = 1
Worst preparation time (ms)                = 9
Best preparation time (ms)                 = 9
Internal rows deleted                      = 0
Internal rows inserted                     = 0
Rows read                             = 1000
Internal rows updated                      = 0
Rows written                               = 0
Statement sorts                            = 1
Statement sort overflows                   = 0
Total sort time                            = 5
Buffer pool data logical reads        = 11
Buffer pool data physical reads            = 10
Buffer pool temporary data logical reads   = 0
Buffer pool temporary data physical reads  = 0
Buffer pool index logical reads       = 12
Buffer pool index physical reads           = 9
Buffer pool temporary index logical reads  = 0
Buffer pool temporary index physical reads = 0
Total execution time (sec.ms)         = 0.014597
Total user cpu time (sec.ms)               = 0.000000
Total system cpu time (sec.ms)             = 0.010014
Statement text                             = SELECT ID, TYPE, STATE FROM SAMPLE.CARS
                                             WHERE STATE < 100 AND TYPE = 'A'

Table 3 compares the snapshot monitor values for the query execution with and without distribution statistics:

Table 3. Comparison of snapshot monitor values
Snapshot valueWithout distribution statisticsWith distribution statistics
Rows read99,3361,000
Buffer pool data logical reads8,70111
Buffer pool index logical reads16512
Total execution time (sec.ms)0.5309030.014597

You can see that DB2 has to evaluate less rows to execute the query when distribution statistics are available. This has a positive effect on the CPU costs as well as on the I/O costs. But most important is the total execution time because the total execution time is responsible for the response time of the application. 0.014597 seconds with distribution statistics, versus 0.530903 seconds without distribution statistics, results in a difference of factor 36.

0.014597, versus 0.530903 seconds in our sample, you cannot recognize a difference, as both values lie in the sub-second area. Nevertheless, the difference should not be ignored. If more complex queries are executed, or if several queries are executed in sequence, the difference in the execution time is not just sub-seconds but seconds or even minutes.

Generation of distribution statistics

As already mentioned, distribution statistics are not always collected when statistics are generated using the RUNSTATS command. This makes sense because distribution statistics are important in the case of many duplicates or uneven distributed data. But they do not provide much benefit otherwise.

The following RUNSTATS command collects only the basic statistics for the table CARS (in schema SAMPLE) and the corresponding indexes:

RUNSTATS ON TABLE SAMPLE.CARS AND INDEXES ALL

If, in addition, distribution statistics (frequency and quantile statistics) should be collected for all columns of the table CARS, execute the following command:

RUNSTATS ON TABLE SAMPLE.CARS WITH DISTRIBUTION AND INDEXES ALL

Generating distribution statistics means a significant, additional expense for DB2 and, therefore, influences the execution time of the RUNSTATS command. So restrict generating distribution statistics to those columns that require distribution statistics.

RUNSTATS ON TABLE SAMPLE.CARS WITH DISTRIBUTION ON COLUMNS (TYPE, STATE) AND INDEXES ALL

Distribution statistics should be collected for columns satisfying the following conditions:

  • The column has many duplicates (frequency statistics), or the values of the column are not evenly distributed, which means they are clustered in certain areas (quantile statistics)
  • The column is used in predicates that check for equality or inequality (frequency statistics), or the column is used in predicates that check for smaller (- equal), bigger (- equal), or between (quantile statistics)

In case of the frequency statistics, it is important to define for how many values the number of duplicates should be gathered. It would be too costly to do so for all values of a column. If no explicit number is defined when executing RUNSTATS, DB2 uses a default number that is provided by the database parameter NUM_FREQVALUES. Because the default for NUM_FREQVALUES is 10, DB2 collects the number of duplicates for the 10 values that appear most often in the column, given that RUNSTATS is executed without defining an explicit number and the database parameter NUM_FREQVALUES has not been altered.

Similar to the frequency statistics, a number for the exactness has to be defined for the quantile statistics. In the case of the quantile statistics, how many "measurements" should be used is defined. The values of a column are considered to be in an ascending sort order, and in regular row intervals, the corresponding values are determined. The more measurements are used, the more accurate the estimates are of the optimizer regarding the filter factors for predicates that check for ranges (<, >, <=, >=, BETWEEN). If no explicit number is defined when executing RUNSTATS, DB2 uses the default number that is provided by the database parameter NUM_QUANTILES. The default for NUM_QUANTILES is 20, which means 20 measurements are used. This is already a good value because it guarantees that the optimizer, by using the quantile statistics, only has an estimated maximum error of 5% for the determined filter factors.

If the values for NUM_FREQVALUES and NUM_QUANTILES should not be taken from the database configuration (DB CFG), they can be explicitly defined when executing RUNSTATS:

RUNSTATS ON TABLE SAMPLE.CARS WITH DISTRIBUTION ON COLUMNS (TYPE NUM_FREQVALUES 10 NUM_QUANTILES 20, STATE NUM_FREQVALUES 15 NUM_QUANTILES 30) AND INDEXES ALL

How to check if distribution statistics exist

To check whether distribution statistics have been collected or not for a certain table, look at the contents of the catalog view SYSCAT.COLDIST:

SELECT * FROM SYSCAT.COLDIST WHERE TABSCHEMA = 'SAMPLE' AND TABNAME = 'CARS'

The view SYSCAT.COLDIST has the following structure:

Table 4. Structure of SYSCAT.COLDIST
Column nameData typeNullableDescription
TABSCHEMAVARCHAR(128)NoQualifier of the table to which this entry applies
TABNAMEVARCHAR(128)NoName of the table to which this entry applies
COLNAMEVARCHAR(128)NoName of the column to which this entry applies
TYPECHAR(1)NoF = Frequency (most frequent value)
Q = Quantile value
SEQNOSMALLINTNoIf TYPE = F, then N in this column identifies the Nth most frequent value
If TYPE = Q, then N in this column identifies the Nth quantile value
COLVALUEVARCHAR(254)YesThe data value, as a character literal or a NULL value
VALCOUNTBIGINTNoIf TYPE = F, then VALCOUNT is the number of occurrences of COLVALUE in the column
If TYPE = Q, then VALCOUNT is the number of rows whose value is less than or equal to COLVALUE
DISTCOUNTBIGINTYesIf TYPE = Q, this column records the number of distinct values that are less than or equal to COLVALUE (NULL if unavailable)

SYSCAT.COLDIST contains entries for a certain table only if distribution statistics have been collected for at least one of its columns. If RUNSTATS is executed later again without WITH DISTRIBUTION, the corresponding entries of the table in SYSCAT.COLDIST are deleted.

Distribution statistics and parameter markers / host variables

JDBC offers two possibilities to execute a dynamic SQL statement and, therefore, provides two different interfaces:

  • java.sql.Statement
  • java.sql.PreparedStatement

PreparedStatement is a sub-interface of Statement and allows the usage of parameter markers (= placeholders; in other programming languages such placeholders are called host variables) -- in contrast to Statement. In the case of a PreparedStatement, the SQL statement to execute -- parameter markers included -- is first compiled, then values are bound to the parameter markers, and finally the SQL statement is executed.

The following code snippets show the differences when using a Statement and a PreparedStatement.

Listing 3. Executing dynamic SQL statements using the JDBC Statement interface
java.sql.Connection con = ...;

java.sql.Statement stmt1 = con.createStatement();
String insert1 = "INSERT INTO TABLE_X (COL_Y) VALUES ('ABC')";
stmt1.executeUpdate(insert1);

java.sql.Statement stmt2 = con.createStatement();
String insert2 = "INSERT INTO TABLE_X (COL_Y) VALUES ('XYZ')";
stmt2.executeUpdate(insert2);

con.commit();
Listing 4. Executing dynamic SQL statements using the JDBC PreparedStatement interface
java.sql.Connection con = ...;

String insert = "INSERT INTO TABLE_X (COL_Y) VALUES (?)";
java.sql.PreparedStatement pstmt = con.prepareStatement(insert);

pstmt.setString(1, "ABC");
pstmt.executeUpdate();

pstmt.setString(1, "XYZ");
pstmt.executeUpdate();

con.commit();

If a simple SQL statement is executed multiple times (like the INSERT statement in the sample), a PreparedStatement has the advantage that it has to be compiled by the database only once, then it can be executed multiple times without the need for re-compilation. Given that in the sample, several thousand rows should be inserted, using a PreparedStatement may provide a shorter execution time because the prepare/compile time is required only once, instead of a thousand times.

However, Java developers often use PreparedStatements because coding is more elegant when values/filter criteria are provided at run time. Look at the following code snippets:

Listing 5. Filling in filter criteria using string concatenation
int state = 100;
String type = "A";
...
java.sql.Connection con = ...;

java.sql.Statement stmt = con.createStatement();
String select = "SELECT * FROM CARS WHERE 
STATE < " + state + " AND TYPE = '" + type + "'";

java.sql.ResultSet rs = stmt.executeQuery(select);
while (rs.next()) {
    ...
}
Listing 6. Filling in filter criteria using parameter markers
int state = 100;
String type = "A";
...
java.sql.Connection con = ...;

String select = "SELECT * FROM CARS WHERE STATE < ? AND TYPE = ?";

java.sql.PreparedStatement pstmt = con.prepareStatement(select);
pstmt.setInt(1, state);
pstmt.setString(2, type);

java.sql.ResultSet rs = pstmt.executeQuery();
while (rs.next()) {
    ...
}

The variant with the PreparedStatement is more elegant regarding the coding, as no string arithmetic is required to put the values for STATE and TYPE in the SQL statement. But a disadvantage in this case is that the compilation of SELECT (the creation of the access plan) occurs before the values for the predicates in the WHERE clause are bound. Predicates with concrete values are important for the optimizer to be able to use available distribution statistics.

Let's look again at our sample query for the table CARS, but with parameter markers instead of concrete values:

SELECT * FROM CARS WHERE STATE < ? AND TYPE = ?

Distribution statistics for the table CARS are still available, as they have been collected in the meantime. Nevertheless, another access plan is generated when parameter markers are used.

Figure 3. Access plan for sample query with parameter markers and distribution statistics
Access plan for sample query with parameter markers and distribution statistics
Access plan for sample query with parameter markers and distribution statistics

Although the access plan is better than the access plan with concrete values without distribution statistics, it is worse than the access plan with concrete values with distribution statistics. This can also be seen in the monitor output from a snapshot for dynamic SQL, as shown in Listing 7:

Listing 7. Snapshot for sample query with parameter markers and distribution statistics
Number of executions                       = 1
Number of compilations                     = 1
Worst preparation time (ms)                = 10
Best preparation time (ms)                 = 10
Internal rows deleted                      = 0
Internal rows inserted                     = 0
Rows read                             = 99
Internal rows updated                      = 0
Rows written                               = 0
Statement sorts                            = 1
Statement sort overflows                   = 0
Total sort time                            = 0
Buffer pool data logical reads        = 11
Buffer pool data physical reads            = 10
Buffer pool temporary data logical reads   = 0
Buffer pool temporary data physical reads  = 0
Buffer pool index logical reads       = 268
Buffer pool index physical reads           = 125
Buffer pool temporary index logical reads  = 0
Buffer pool temporary index physical reads = 0
Total execution time (sec.ms)         = 0.081908
Total user cpu time (sec.ms)               = 0.060086
Total system cpu time (sec.ms)             = 0.010014
Statement text                             = SELECT ID, TYPE, STATE FROM SAMPLE.CARS
                                             WHERE STATE < ? AND TYPE = ?

When comparing these values with the ones gathered before, the following table results:

Table 5. Comparison of snapshot monitor values continued
Snapshot valueWithout distribution statisticsWith distribution statisticsWith distribution statistics and parameter markers
Rows read99,3361,00099
Buffer pool data logical reads8,7011111
Buffer pool index logical reads16512268
Total execution time (sec.ms)0.5309030.0145970.081908

The value for "Rows read" is the lowest with parameter markers, but this is not really important, as the "Buffer pool index logical reads" go up in this case. Important is the fact that the "Total execution time" is worse than in the case with distribution statistics and concrete values, by a factor of eight in the sample. As already said, in the sample, look at a very simple query that operates on a small amount of data. Therefore, the differences regarding the execution time seem to be irrelevant. But in the case of more complex queries with execution times of several seconds/minutes, the differences become obvious.

The sample shows that distribution statistics are of limited use when used in conjunction with parameter markers. In the case of many duplicates/uneven data distribution and complex queries, this can lead to worse execution times than without parameter markers. This does not mean that using PreparedStatements is always bad. The opposite is true: PreparedStatement is recommended when a simple statement is executed many times with different values because the statement has to be compiled only once. In the case of complex queries, as they are common to reporting systems and DSS (Decision Support Systems), the usage of parameter markers is likely to be coupled with disadvantages. As you have seen, this is because distribution statistics cannot be fully exploited, and compile times only apply once because the queries are not executed more than once.

Summary

In this article, you have taken a look at the use of distribution statistics by the DB2 optimizer. If a database contains many duplicate values or data is not evenly distributed, the DB2 optimizer can produce better access plans for SQL statements when distribution statistics are generated in addition to the basic statistics. The article explains the different kinds of distribution statistics, shows how these statistics can be generated using the RUNSTATS command, and contains a sample scenario that illustrates the importance of distribution statistics in certain cases. You have also examined how the use of parameter markers/host variables may limit the consideration of distribution statistics by the DB2 optimizer.


Downloadable resources


Related topics

  • Administration Guide: Performance: This DB2 UDB documentation provides additional information concerning database statistics (specifically "Chapter 5. System catalog statistics").
  • DB2 Information Center: Find the complete DB2 UDB documentation in HTML format.
  • Build your next development project with IBM trial software, available for download directly from developerWorks.
static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=128599
ArticleTitle=Distribution statistics uses with the DB2 optimizer
publish-date=06152006