## Contents

- Introduction
- Basic statistics and distribution statistics
- Types of distribution statistics - Frequency and quantile statistics
- Usage of distribution statistics by the DB2 optimizer - A sample
- Generation of distribution statistics
- How to check if distribution statistics exist
- Distribution statistics and parameter markers / host variables
- Summary
- Downloadable resources
- Related topics
- Comments

# Distribution statistics uses with the DB2 optimizer

Create efficient access plans for faster SQL

To execute a query or a DML statement (`INSERT`

,
`UPDATE`

, `DELETE`

), DB2
has to create an *access plan*. An access plan defines the order for
accessing tables, which indexes are used, and by what kind of join methods
data is correlated. A good access plan is key for the fast execution of an
SQL statement. The DB2 optimizer creates the access plans. The DB2
optimizer is a cost-based optimizer, which means it makes decisions based
on statistics that are available for the tables and indexes. When
generating statistics, DB2 offers the possibility to create so-called
distribution statistics in addition to the basic statistics. An
understanding of distribution statistics is not only relevant for database
administrators, but is also relevant for application developers.
Application developers have to be careful, as in certain situations
distribution statistics are important for the DB2 optimizer. Usage of host
variables or parameter markers
(`java.sql.PreparedStatement`

in Java) may hinder
the optimizer to make optimal use of distribution statistics. This article
explains distribution statistics, when they are important, and what
application developers should consider so that the DB2 optimizer can
create efficient access plans.

## Basic statistics and distribution statistics

Before examining distribution statistics, let's have a look at the basic
statistics that are always collected for a table when
`RUNSTATS`

is executed.

Statistics for the table:

- The number of pages in use
- The number of pages that contain rows
- The number of rows that overflow
- The number of rows in the table (cardinality)
- For MDC tables, the number of blocks that contain data

Statistics for each column in the table:

- The cardinality of the column
- The average length of the column
- The second highest value in the column
- The second lowest value in the column
- The number of NULLs in the column

In general, when executing `RUNSTATS`

, statistics
are not only collected for a table, but are also collected for the
corresponding indexes. For the statistics gathered for indexes, see the
DB2 Administration Guide: Performance - Statistical information that is collected.

Looking at the basic statistics for a table, you can see that the DB2 optimizer knows how many rows a table consists of (table cardinality) and how many distinct values a column contains (column cardinality). But there is also information that the basic statistics cannot answer. For example, the basic statistics cannot tell the optimizer about the frequency of certain values in a column. Assume a table, TABLE_X, with about 1,000,000 rows has the following query executed on the table:

`SELECT * FROM TABLE_X WHERE COLUMN_Y = 'VALUE_Z'`

Wouldn't it be important for the DB2 optimizer to know how many rows of
TABLE_X satisfy the condition
`COLUMN_Y = 'VALUE_Z'`

? In other words: Wouldn't
it be good to know if the query returns 1, 100, 1,000, or 10,000 rows?

Indeed, from the basic statistics, the DB2 optimizer can only estimate the
frequency of `'VALUE_Z'`

in
`COLUMN_Y`

. In this case, the optimizer assumes
an even distribution of all values in `COLUMN_Y`

,
which means it assumes that all values have the same frequency. The
estimate is okay if this is approximately true. But if some values appear
more often than others (for example, if
`'VALUE_Z'`

appears 900,000 times, that is in 90%
of all rows), this cannot be considered by the optimizer and may lead to a
sub-optimal access plan. This is where distribution statistics come into
play. Distribution statistics provide the optimizer information about the
frequency and the distribution of data and complement the basic statistics
in an important way if many duplicates are stored in a database and data
is not evenly distributed in the tables.

## Types of distribution statistics - Frequency and quantile statistics

There are different types of distribution statistics -- frequency and quantile statistics. Let's examine the two different types by looking at a sample table.

A sample table, "CARS," representative of an automobile manufacturer, contains a row for each produced car. A car can be identified by its ID, so that "ID" is the primary key (PK) of the table "CARS." In addition, there is a column "STATE" that indicates the step in the manufacturing process in which a car is currently. The manufacturing process of a car starts with step 1, followed by the steps 2, 3, ..., 49, 50, 51, ..., 98, 99, and ends with step 100 -- step 100 means that a car is completely finished. The rows of the cars completely finished also remain in the table and are used by follow-up processes like complaint management, quality assurance, and so on. The automobile manufacturer produces 10 different car models (column "TYPE"). To simplify things, the models are named A, B, C, D, ..., J in the sample table. In addition to the primary key index (on column "ID"), there is an index on column "STATE" ("I_STATE") and an index on column "TYPE" ("I_TYPE"). In reality, a table "CARS" would comprise many more columns than "ID," "STATE," and "TYPE." In the sample table, these other columns are missing for clarity reasons.

### Frequency statistics

Let's assume the table CARS currently has about 1,000,000 entries and the different models appear with the following frequency:

##### Table 1. Frequency statistics for column TYPE of table CARS

TYPE | COUNT(TYPE) |
---|---|

A | 506135 |

B | 301985 |

C | 104105 |

D | 52492 |

E | 19584 |

F | 10123 |

G | 4876 |

H | 4589 |

I | 4403 |

J | 3727 |

Model A is the car that buyers choose most often, therefore 50% of all produced cars are of this type. Model A is followed by the models B and C with 30% and 10% of all cars respectively. The rest of the models make up only 10% altogether.

The table shows the exact frequency statistics for column "TYPE." From the basic statistics, the DB2 optimizer only knows that the table contains 1,000,000 rows (table cardinality) and 10 different values (= models), namely A to J. Without distribution statistics, the optimizer assumes that each value appears with the same frequency, which is about 100,000 times. When distribution statistics are generated for the column "TYPE," the optimizer knows the real frequency of each model. Therefore the optimizer is aware of the different frequencies of the available models.

Frequency statistics are used by the optimizer to calculate filter factors for predicates that check for equality or inequality. For example:

`SELECT * FROM CARS WHERE TYPE = 'H'`

### Quantile statistics

In contrast to frequency statistics, quantile statistics are not about the frequency of different values but about how many rows of a table are below or above a certain value (or how many rows are between two values). Quantile statistics provide the information whether the values of a column are clustered or not. To get this information, DB2 assumes an ascending sort order of a column's values and determines in regular row intervals the corresponding values.

Let's look at the column "STATE" of the table CARS in ascending sort order. In regular row intervals, the corresponding values of "STATE" are determined.

##### Table 2. Quantile statistics for column STATE of table CARS

COUNT(row) | STATE ASC |
---|---|

5479 | 1 |

54948 | 10 |

109990 | 21 |

159885 | 31 |

215050 | 42 |

265251 | 52 |

320167 | 63 |

370057 | 73 |

424872 | 84 |

475087 | 94 |

504298 | 100 |

... | 100 |

1012019 | 100 |

Because completely finished cars are not deleted from the table, the number of cars in state 100 (= completely finished) is higher than the number of cars in all other states. The completely finished cars make up 50% of all entries in the table CARS.

**Note:** In reality, the number of completely finished cars would be
even higher (for example, over 99%). We'll look at this case later in a
concrete sample.

The table shows the quantile statistics for the column "STATE." With the information about how many rows are below and above the determined values respectively, the optimizer can calculate filter factors for predicates that test for smaller (- equal), bigger (- equal), or between. For example:

`SELECT * FROM CARS WHERE STATE < 100`

`SELECT * FROM CARS WHERE STATE BETWEEN 50 AND 70`

The filter factors calculated based on the available quantile statistics are not exact, but the estimate error is below 5%, even when only 20 values are collected.

## Usage of distribution statistics by the DB2 optimizer - A sample

Let's look at a complete sample where the DB2 optimizer can use distribution statistics to better estimate filter factors that lead to a better access plan.

The sample query reads data from the table CARS that has already been defined. For the automobiles' data in the table CARS, the following assumptions apply:

- The table cardinality is 1,000,000, which means the table consists of 1,000,000 rows
- 99.9% of all cars (that is 999,000) in the table are completely finished (column "STATE" = 100), as the information about these cars has to be retained for follow-on processes (complaint management, quality assurance, and so on.). The remaining 1,000 cars are currently going through the manufacturing process.
- The 10 different car models (column "TYPE") A to J offered by the manufacturer nearly appear with the same frequency in the table.

**Note:** The script create_table_cars.sql, which you can
download from this article, creates such a table
CARS, including indexes on the columns "STATE" and "TYPE." The results of
this sample can be reproduced with this table. The sample table was
created within the DB2 SAMPLE database (command
`db2sampl`

) using default settings for DBM CFG
and DB CFG.

The sample query selects all cars of model A that are currently going through the manufacturing process:

`SELECT * FROM CARS WHERE STATE < 100 AND TYPE = 'A'`

First analyze the access plan that the optimizer chooses when no distribution statistics are available, which is only basic statistics for the table CARS, and that its indexes exist.

##### Figure 1. Access plan for sample query without distribution statistics

As the optimizer is not aware of the uneven distribution of the STATE
values, it decides to use the index I_TYPE. Usage of this index is costly
because about 100,000 RIDs (record IDs) have to be read from the index
before the table CARS is accessed. In addition, the assumption regarding
the number of rows returned by the query is wrong. Because the optimizer
assumes that all manufacturing steps (1 to 100) have the same frequency,
it does not expect that the predicate
`STATE < 100`

filters a number of rows
worth mentioning. But as you know, this is the case because only 1,000 of
all the 1,000,000 cars are currently going through the manufacturing
process.

When the query is executed without distribution statistics, a snapshot for dynamic SQL returns the following monitor values (given that the required monitor switches are active):

##### Listing 1. Snapshot for sample query without distribution statistics

Number of executions = 1 Number of compilations = 1 Worst preparation time (ms) = 9 Best preparation time (ms) = 9 Internal rows deleted = 0 Internal rows inserted = 0 Rows read = 99336 Internal rows updated = 0 Rows written = 0 Statement sorts = 0 Statement sort overflows = 0 Total sort time = 0 Buffer pool data logical reads = 8701 Buffer pool data physical reads = 8131 Buffer pool temporary data logical reads = 0 Buffer pool temporary data physical reads = 0 Buffer pool index logical reads = 165 Buffer pool index physical reads = 155 Buffer pool temporary index logical reads = 0 Buffer pool temporary index physical reads = 0 Total execution time (sec.ms) = 0.530903 Total user cpu time (sec.ms) = 0.280403 Total system cpu time (sec.ms) = 0.230332 Statement text = SELECT ID, TYPE, STATE FROM SAMPLE.CARS WHERE STATE < 100 AND TYPE = 'A'

At the moment, we will not further analyze these values, but remember them to compare them later with the monitor values for the same query executed when distribution statistics are available.

Next, generate distribution statistics for the table CARS and execute the query again. In this case, the optimizer chooses the following access plan:

##### Figure 2. Access plan for sample query with distribution statistics

The costs for this access plan are clearly below the costs for the access
plan without distribution statistics: 203.809, instead of 3242.63. This is
because the optimizer now knows that the predicate
`STATE < 100`

has a high filter factor and
therefore only returns the nearly 1,000 cars currently going through the
manufacturing process. So, in this case, the table CARS is not accessed
using the index `I_TYPE`

but using the index
`I_STATE`

. Besides, the total number or rows in
the result set is now estimated correctly. 1,000 cars have not been
finished yet, and the different models appear with the same frequency.
This leads to a result set that consists of about 100 rows.

The access plan with distribution statistics is better than the access plan without distribution statistics. But does this have an impact on the query's execution time, too? Listing 2 contains the corresponding snapshot monitor data:

##### Listing 2. Snapshot for sample query with distribution statistics

Number of executions = 1 Number of compilations = 1 Worst preparation time (ms) = 9 Best preparation time (ms) = 9 Internal rows deleted = 0 Internal rows inserted = 0 Rows read = 1000 Internal rows updated = 0 Rows written = 0 Statement sorts = 1 Statement sort overflows = 0 Total sort time = 5 Buffer pool data logical reads = 11 Buffer pool data physical reads = 10 Buffer pool temporary data logical reads = 0 Buffer pool temporary data physical reads = 0 Buffer pool index logical reads = 12 Buffer pool index physical reads = 9 Buffer pool temporary index logical reads = 0 Buffer pool temporary index physical reads = 0 Total execution time (sec.ms) = 0.014597 Total user cpu time (sec.ms) = 0.000000 Total system cpu time (sec.ms) = 0.010014 Statement text = SELECT ID, TYPE, STATE FROM SAMPLE.CARS WHERE STATE < 100 AND TYPE = 'A'

Table 3 compares the snapshot monitor values for the query execution with and without distribution statistics:

##### Table 3. Comparison of snapshot monitor values

Snapshot value | Without distribution statistics | With distribution statistics |
---|---|---|

Rows read | 99,336 | 1,000 |

Buffer pool data logical reads | 8,701 | 11 |

Buffer pool index logical reads | 165 | 12 |

Total execution time (sec.ms) | 0.530903 | 0.014597 |

You can see that DB2 has to evaluate less rows to execute the query when distribution statistics are available. This has a positive effect on the CPU costs as well as on the I/O costs. But most important is the total execution time because the total execution time is responsible for the response time of the application. 0.014597 seconds with distribution statistics, versus 0.530903 seconds without distribution statistics, results in a difference of factor 36.

0.014597, versus 0.530903 seconds in our sample, you cannot recognize a difference, as both values lie in the sub-second area. Nevertheless, the difference should not be ignored. If more complex queries are executed, or if several queries are executed in sequence, the difference in the execution time is not just sub-seconds but seconds or even minutes.

## Generation of distribution statistics

As already mentioned, distribution statistics are *not* always
collected when statistics are generated using the
`RUNSTATS`

command. This makes sense because
distribution statistics are important in the case of many duplicates or
uneven distributed data. But they do not provide much benefit
otherwise.

The following `RUNSTATS`

command collects only the
basic statistics for the table CARS (in schema SAMPLE) and the
corresponding indexes:

`RUNSTATS ON TABLE SAMPLE.CARS AND INDEXES ALL`

If, in addition, distribution statistics (frequency and quantile statistics) should be collected for all columns of the table CARS, execute the following command:

`RUNSTATS ON TABLE SAMPLE.CARS WITH DISTRIBUTION AND INDEXES ALL`

Generating distribution statistics means a significant, additional expense
for DB2 and, therefore, influences the execution time of the
`RUNSTATS`

command. So restrict generating
distribution statistics to those columns that require distribution
statistics.

`RUNSTATS ON TABLE SAMPLE.CARS WITH DISTRIBUTION ON COLUMNS (TYPE, STATE) AND INDEXES ALL`

Distribution statistics should be collected for columns satisfying the following conditions:

- The column has many duplicates (frequency statistics), or the values of the column are not evenly distributed, which means they are clustered in certain areas (quantile statistics)
- The column is used in predicates that check for equality or inequality (frequency statistics), or the column is used in predicates that check for smaller (- equal), bigger (- equal), or between (quantile statistics)

In case of the frequency statistics, it is important to define for how many
values the number of duplicates should be gathered. It would be too costly
to do so for all values of a column. If no explicit number is defined when
executing `RUNSTATS`

, DB2 uses a default number
that is provided by the database parameter
`NUM_FREQVALUES`

. Because the default for
`NUM_FREQVALUES`

is 10, DB2 collects the number
of duplicates for the 10 values that appear most often in the column,
given that `RUNSTATS`

is executed without
defining an explicit number and the database parameter
`NUM_FREQVALUES`

has not been altered.

Similar to the frequency statistics, a number for the exactness has to be
defined for the quantile statistics. In the case of the quantile
statistics, how many "measurements" should be used is defined. The values
of a column are considered to be in an ascending sort order, and in
regular row intervals, the corresponding values are determined. The more
measurements are used, the more accurate the estimates are of the
optimizer regarding the filter factors for predicates that check for
ranges (`<`

,
`>`

, `<=`

,
`>=`

, `BETWEEN`

).
If no explicit number is defined when executing
`RUNSTATS`

, DB2 uses the default number that is
provided by the database parameter
`NUM_QUANTILES`

. The default for
`NUM_QUANTILES`

is 20, which means 20
measurements are used. This is already a good value because it guarantees
that the optimizer, by using the quantile statistics, only has an
estimated maximum error of 5% for the determined filter factors.

If the values for `NUM_FREQVALUES`

and
`NUM_QUANTILES`

should not be taken from the
database configuration (`DB CFG`

), they can be
explicitly defined when executing `RUNSTATS`

:

`RUNSTATS ON TABLE SAMPLE.CARS WITH DISTRIBUTION ON COLUMNS (TYPE NUM_FREQVALUES 10 NUM_QUANTILES 20, STATE NUM_FREQVALUES 15 NUM_QUANTILES 30) AND INDEXES ALL`

## How to check if distribution statistics exist

To check whether distribution statistics have been collected or not for a
certain table, look at the contents of the catalog view
`SYSCAT.COLDIST`

:

`SELECT * FROM SYSCAT.COLDIST WHERE TABSCHEMA = 'SAMPLE' AND TABNAME = 'CARS'`

The view `SYSCAT.COLDIST`

has the following
structure:

##### Table 4. Structure of SYSCAT.COLDIST

Column name | Data type | Nullable | Description |
---|---|---|---|

TABSCHEMA | VARCHAR(128) | No | Qualifier of the table to which this entry applies |

TABNAME | VARCHAR(128) | No | Name of the table to which this entry applies |

COLNAME | VARCHAR(128) | No | Name of the column to which this entry applies |

TYPE | CHAR(1) | No | F = Frequency (most frequent
value) Q = Quantile value |

SEQNO | SMALLINT | No | If TYPE = F, then N in this column
identifies the Nth most frequent value If TYPE = Q, then N in this column identifies the Nth quantile value |

COLVALUE | VARCHAR(254) | Yes | The data value, as a character literal or a NULL value |

VALCOUNT | BIGINT | No | If TYPE = F, then VALCOUNT is the number
of occurrences of COLVALUE in the column If TYPE = Q, then VALCOUNT is the number of rows whose value is less than or equal to COLVALUE |

DISTCOUNT | BIGINT | Yes | If TYPE = Q, this column records the number of distinct values that are less than or equal to COLVALUE (NULL if unavailable) |

`SYSCAT.COLDIST`

contains entries for a certain
table only if distribution statistics have been collected for at least one
of its columns. If `RUNSTATS`

is executed later
again without `WITH DISTRIBUTION`

, the
corresponding entries of the table in
`SYSCAT.COLDIST`

are deleted.

## Distribution statistics and parameter markers / host variables

JDBC offers two possibilities to execute a dynamic SQL statement and, therefore, provides two different interfaces:

`java.sql.Statement`

`java.sql.PreparedStatement`

`PreparedStatement`

is a sub-interface of
`Statement`

and allows the usage of parameter
markers (= placeholders; in other programming languages such placeholders
are called host variables) -- in contrast to
`Statement`

. In the case of a
`PreparedStatement`

, the SQL statement to execute
-- parameter markers included -- is first compiled, then values are bound
to the parameter markers, and finally the SQL statement is executed.

The following code snippets show the differences when using a
`Statement`

and a
`PreparedStatement`

.

##### Listing 3. Executing dynamic SQL statements using the JDBC Statement interface

java.sql.Connection con = ...; java.sql.Statement stmt1 = con.createStatement(); String insert1 = "INSERT INTO TABLE_X (COL_Y) VALUES ('ABC')"; stmt1.executeUpdate(insert1); java.sql.Statement stmt2 = con.createStatement(); String insert2 = "INSERT INTO TABLE_X (COL_Y) VALUES ('XYZ')"; stmt2.executeUpdate(insert2); con.commit();

##### Listing 4. Executing dynamic SQL statements using the JDBC PreparedStatement interface

java.sql.Connection con = ...; String insert = "INSERT INTO TABLE_X (COL_Y) VALUES (?)"; java.sql.PreparedStatement pstmt = con.prepareStatement(insert); pstmt.setString(1, "ABC"); pstmt.executeUpdate(); pstmt.setString(1, "XYZ"); pstmt.executeUpdate(); con.commit();

If a simple SQL statement is executed multiple times (like the
`INSERT`

statement in the sample), a
`PreparedStatement`

has the advantage that it has
to be compiled by the database only once, then it can be executed multiple
times without the need for re-compilation. Given that in the sample,
several thousand rows should be inserted, using a
`PreparedStatement`

may provide a shorter
execution time because the prepare/compile time is required only once,
instead of a thousand times.

However, Java developers often use
`PreparedStatement`

s because coding is more
elegant when values/filter criteria are provided at run time. Look at the
following code snippets:

##### Listing 5. Filling in filter criteria using string concatenation

int state = 100; String type = "A"; ... java.sql.Connection con = ...; java.sql.Statement stmt = con.createStatement(); String select = "SELECT * FROM CARS WHERE STATE < " + state + " AND TYPE = '" + type + "'"; java.sql.ResultSet rs = stmt.executeQuery(select); while (rs.next()) { ... }

##### Listing 6. Filling in filter criteria using parameter markers

int state = 100; String type = "A"; ... java.sql.Connection con = ...; String select = "SELECT * FROM CARS WHERE STATE < ? AND TYPE = ?"; java.sql.PreparedStatement pstmt = con.prepareStatement(select); pstmt.setInt(1, state); pstmt.setString(2, type); java.sql.ResultSet rs = pstmt.executeQuery(); while (rs.next()) { ... }

The variant with the `PreparedStatement`

is more
elegant regarding the coding, as no string arithmetic is required to put
the values for `STATE`

and
`TYPE`

in the SQL statement. But a disadvantage
in this case is that the compilation of `SELECT`

(the creation of the access plan) occurs before the values for the
predicates in the `WHERE`

clause are bound.
Predicates with concrete values are important for the optimizer to be able
to use available distribution statistics.

Let's look again at our sample query for the table CARS, but with parameter markers instead of concrete values:

`SELECT * FROM CARS WHERE STATE < ? AND TYPE = ?`

Distribution statistics for the table CARS are still available, as they have been collected in the meantime. Nevertheless, another access plan is generated when parameter markers are used.

##### Figure 3. Access plan for sample query with parameter markers and distribution statistics

Although the access plan is better than the access plan with concrete values without distribution statistics, it is worse than the access plan with concrete values with distribution statistics. This can also be seen in the monitor output from a snapshot for dynamic SQL, as shown in Listing 7:

##### Listing 7. Snapshot for sample query with parameter markers and distribution statistics

Number of executions = 1 Number of compilations = 1 Worst preparation time (ms) = 10 Best preparation time (ms) = 10 Internal rows deleted = 0 Internal rows inserted = 0 Rows read = 99 Internal rows updated = 0 Rows written = 0 Statement sorts = 1 Statement sort overflows = 0 Total sort time = 0 Buffer pool data logical reads = 11 Buffer pool data physical reads = 10 Buffer pool temporary data logical reads = 0 Buffer pool temporary data physical reads = 0 Buffer pool index logical reads = 268 Buffer pool index physical reads = 125 Buffer pool temporary index logical reads = 0 Buffer pool temporary index physical reads = 0 Total execution time (sec.ms) = 0.081908 Total user cpu time (sec.ms) = 0.060086 Total system cpu time (sec.ms) = 0.010014 Statement text = SELECT ID, TYPE, STATE FROM SAMPLE.CARS WHERE STATE < ? AND TYPE = ?

When comparing these values with the ones gathered before, the following table results:

##### Table 5. Comparison of snapshot monitor values continued

Snapshot value | Without distribution statistics | With distribution statistics | With distribution statistics and parameter markers |
---|---|---|---|

Rows read | 99,336 | 1,000 | 99 |

Buffer pool data logical reads | 8,701 | 11 | 11 |

Buffer pool index logical reads | 165 | 12 | 268 |

Total execution time (sec.ms) | 0.530903 | 0.014597 | 0.081908 |

The value for "Rows read" is the lowest with parameter markers, but this is not really important, as the "Buffer pool index logical reads" go up in this case. Important is the fact that the "Total execution time" is worse than in the case with distribution statistics and concrete values, by a factor of eight in the sample. As already said, in the sample, look at a very simple query that operates on a small amount of data. Therefore, the differences regarding the execution time seem to be irrelevant. But in the case of more complex queries with execution times of several seconds/minutes, the differences become obvious.

The sample shows that distribution statistics are of limited use when used
in conjunction with parameter markers. In the case of many
duplicates/uneven data distribution and complex queries, this can lead to
worse execution times than without parameter markers. This does not mean
that using `PreparedStatement`

s is always bad.
The opposite is true: `PreparedStatement`

is
recommended when a simple statement is executed many times with different
values because the statement has to be compiled only once. In the case of
complex queries, as they are common to reporting systems and DSS (Decision
Support Systems), the usage of parameter markers is likely to be coupled
with disadvantages. As you have seen, this is because distribution
statistics cannot be fully exploited, and compile times only apply once
because the queries are not executed more than once.

## Summary

In this article, you have taken a look at the use of distribution
statistics by the DB2 optimizer. If a database contains many duplicate
values or data is not evenly distributed, the DB2 optimizer can produce
better access plans for SQL statements when distribution statistics are
generated in addition to the basic statistics. The article explains the
different kinds of distribution statistics, shows how these statistics can
be generated using the `RUNSTATS`

command, and
contains a sample scenario that illustrates the importance of distribution
statistics in certain cases. You have also examined how the use of
parameter markers/host variables may limit the consideration of
distribution statistics by the DB2 optimizer.

#### Downloadable resources

- PDF of this content
- DDL for sample table CARS (create_table_cars.zip | 1KB)

#### Related topics

- Administration Guide: Performance: This DB2 UDB documentation provides additional information concerning database statistics (specifically "Chapter 5. System catalog statistics").
- DB2 Information Center: Find the complete DB2 UDB documentation in HTML format.
- Build your next development project with IBM trial software, available for download directly from developerWorks.