Column group statistics
If your query has more than one join predicate joining two tables, the Db2® optimizer calculates how selective each of the predicates is before choosing a plan for executing the query.
SELECT PRODUCT.NAME, RAWMATERIAL.QUALITY
FROM PRODUCT, RAWMATERIAL
WHERE
PRODUCT.COLOR = RAWMATERIAL.COLOR AND
PRODUCT.ELASTICITY = RAWMATERIAL.ELASTICITY
This query
returns the names and raw material quality of all products. There
are two join predicates:PRODUCT.COLOR = RAWMATERIAL.COLOR
PRODUCT.ELASTICITY = RAWMATERIAL.ELASTICITY
The optimizer
assumes that the two predicates are independent, which means that
all variations of elasticity occur for each color. It then estimates
the overall selectivity of the pair of predicates by using catalog
statistics information for each table based on the number of levels
of elasticity and the number of different colors. Based on this estimate,
it might choose, for example, a nested loop join in preference to
a merge join, or the reverse.However, these two predicates might not be independent. For example, highly elastic materials might be available in only a few colors, and the very inelastic materials might be available in a few other colors that are different from the elastic ones. In that case, the combined selectivity of the predicates eliminates fewer rows and the query returns more rows. Without this information, the optimizer might no longer choose the best plan.
RUNSTATS ON TABLE PRODUCT ON COLUMNS ((COLOR, ELASTICITY))
The
optimizer uses these statistics to detect cases of correlation and
to dynamically adjust the combined selectivities of correlated predicates,
thus obtaining a more accurate estimate of the join size and cost.When a query groups data by using keywords such as GROUP BY or DISTINCT, column group statistics also enable the optimizer to compute the number of distinct groupings.
SELECT DEPTNO, YEARS, AVG(SALARY)
FROM EMPLOYEE
GROUP BY DEPTNO, MGR, YEAR_HIRED
Without any index or
column group statistics, the optimizer estimates the number of groupings
(and, in this case, the number of rows returned) as the product of
the number of distinct values in DEPTNO, MGR, and YEAR_HIRED. This
estimate assumes that the grouping key columns are independent. However,
this assumption could be incorrect if each manager manages exactly
one department. Moreover, it is unlikely that each department hires
employees every year. Thus, the product of distinct values of DEPTNO,
MGR, and YEAR_HIRED could be an overestimate of the actual number
of distinct groups.RUNSTATS ON TABLE EMPLOYEE ON COLUMNS ((DEPTNO, MGR, YEAR_HIRED))
In
addition to JOIN predicate correlation, the optimizer manages correlation
with simple equality predicates, such as:DEPTNO = 'Sales' AND MGR = 'John'
In
this example, predicates on the DEPTNO column in the EMPLOYEE table
are likely to be independent of predicates on the YEAR column. However,
the predicates on DEPTNO and MGR are not independent, because each
department would usually be managed by one manager at a time. The
optimizer uses statistical information about columns to determine
the combined number of distinct values and then adjusts the cardinality
estimate to account for correlation between columns.Column group statistics can also be used on statistical views. The column group statistics help adjust the skewed statistics in the statistical view when there is more than one strong correlation in the queries. The optimizer can use these statistics to obtain better cardinality estimates which might result in better access plans.