IBM Support

Distribution statistics for Federated tables.

Technical Blog Post


Abstract

Distribution statistics for Federated tables.

Body

Statistical data for DB2 Federated tables or Nicknames is gathered using the SYSPROC.NNSTATS procedure, documented here for V10.5 of DB2 :

http://www.ibm.com/support/knowledgecenter/en/SS2K5T_10.5.0/com.ibm.swg.im.iis.fed.query.doc/topics/tfpsts03.html

The statistics that are gathered for nicknames is not the same as for regular DB2 tables.

See also : http://www.ibm.com/support/knowledgecenter/en/SS2K5T_10.5.0/com.ibm.swg.im.iis.fed.query.doc/topics/cfpsts01.html

But the statistics most notably do not contain distribution information ( quantiles and frequency values ).

These distribution statistics can be valuable for the DB2 Query Compiler when it needs to make estimates on the filtering factor for a predicate.
e.g. If the query compiler knows that value 10 for column A occurs 100 times in a table TAB1 then it knows that predicate "A = 10" will match exactly 10 rows.


If these frequency statistics are not available, then the filtering effects of a predicate are estimated based on the number of unique values that exist for a particular column.
This is called the "colcard"  value for a column.

The estimate would then be

 

#rows in the table  ( 1/colcard for that column )


If the distribution of the data is fairly even, then this estimate is good, but if this is not the case, then the estimate can be incorrect by a significant amount.
An incorrect estimate can then lead to suboptimal execution plan choices and bad query performance.
Hence, there are many scenarios where there is a need for distribution statistics.


The current functionality of the NNSTATS procedure does not collect distribution statistics, but we can exploit the use of statistical views over nicknames to work around this limitation.
There is a restriction here as well in that the statistical view requires a predicate to be applied which also appears in the SQL statement using the statistical view.

Let's illustrate this behaviour using a small example.

First let's set up the objects, creating a nickname for a table with just 1 unique value for column col4.

 

connect to source;
create table source ( col1 int , col2 int , col3 int, col4 int );
!perl -e 'for($i=0; $i<1000; $i++) { printf("%d,%d,%d,10\n", $i % 10 ,$i % 100 ,$i )  }' > ins.del ;
import from ins.del of del insert into source;
connect reset;

connect to target;
CREATE NICKNAME NN1 FOR "FEDSERV"."BENNYV"."SOURCE";
explain plan for select * from NN1 where col4 = 10;

db2exfmt -d target -1 -o step1.exfmt

 

But I am using a shortcut SQL statement here to see what the estimated number of rows are at the top of the plan.

 

with latest as ( select max( EXPLAIN_TIME ) max_explain_time

                        from bennyv.explain_instance )

      select o.operator_type,
             o.OPERATOR_ID,
             s.stream_count ,
             substr(t.statement_text,1,30)

from

  bennyv.explain_operator o ,

  bennyv.explain_stream s,
     bennyv.explain_statement t,
     latest
where
s.EXPLAIN_TIME = latest.max_explain_time
and s.source_type = 'O'
and s.source_id = o.OPERATOR_ID
and s.EXPLAIN_TIME = o.EXPLAIN_TIME
and s.target_type = 'O'
and s.target_id = 1
and t.EXPLAIN_TIME = o.EXPLAIN_TIME
and t.explain_level = 'O'
;

This yields :

OPERATOR_TYPE OPERATOR_ID STREAM_COUNT             4                             
------------- ----------- ------------------------ ------------------------------
SHIP                    2   +4.00000000000000E+001 select * from NN1 where col4 =

 

So the estimate here is 40. No distribution information was available so it took an estimate of a 0.04 filter factor.
Note that in this case, I also did not run nnstats. In this simplified case, the nnstats data collection would have seen the colcard of 1 and
would have made a good estimate regardless.

If I now create a statistical view for this as follows :

 

create or replace view sv1 as ( SELECT * FROM BENNYV.NN1 );
alter view sv1 enable query optimization;
runstats on view sv1 with distribution;

 

$ db2 "select substr(colvalue,1,10) value, substr(valcount,1,10) number  from sysstat.coldist where tabname = 'SV1' and type = 'F' and colname = 'COL4' and colvalue <> -1"

VALUE      NUMBER    
---------- ----------
10         1000      

  1 record(s) selected.

 

With this select we know that the system catalog holds the right information.

However when repeating the explain SQL, we still get the same estimate of 40 rows.


The reason is that the SQL statement is not matching the statistical view in this case.
For Nickname statistical view matching, at least one predicate in the SQL statement needs to match a predicate in the statistical view.

For that purpose we can introduce a predicate in the statistical view :

 

create or replace view sv2 as ( SELECT * FROM BENNYV.NN1 where col1 < col1 + 1  );

 

The predicate will always be true, so the result set is not altered.
After enabling it for query optimization and collecting statistics, we have the same info as the first statistical view.

If we now add the superfluous predicate to the statement :

explain plan for select * from NN1 where col4 = 10 and col1 < col1 + 1 ;

 

We do get a match for the statistical view and the result set estimate is adjusted.

 

OPERATOR_TYPE OPERATOR_ID STREAM_COUNT             4                             
------------- ----------- ------------------------ ------------------------------
SHIP                    2   +1.00000000000000E+003 select * from NN1 where col4 =

 

The alternative here would have been to keep the original query, but add the col4 = 10 predicate to the statistical view.
This would also work, but I wanted to keep the statistical view as generic as possible so it could be applied to other statements.

Another alternative is to create a new view on top of the nickname which includes the superfluous predicate e.g.

 

create or replace view NN1_VIEW as ( SELECT * FROM BENNYV.NN1 where col1 < col1 + 1  );


Then DB2 would also pick up the information from SV2 if we run :

 

explain plan for select * from NN1_VIEW where col4 = 10 ;

 

In fact, in order to really avoid rewriting any SQL statement: renaming NN1 to  NN1_TABLE

and NN1_VIEW to NN1 would still pick up the benefits of the distribution statistics from the statistical view.

 

 

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SSEPGG","label":"Db2 for Linux, UNIX and Windows"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

UID

ibm13285879