Choosing partitioning keys in DB2 Database Partitioning Feature environments

Choosing proper partitioning keys is important for optimal query performance in IBM® DB2® Enterprise Server Edition for Linux®, UNIX®, and Windows® environments with the Database Partitioning Feature (DPF). To help with this task, this article provides new routines to estimate data skews for existing and new partitioning keys. The article also details best practices and shows how to change a partitioning key while keeping the table accessible.

Share:

Maksym Petrenko (maksymp@ca.ibm.com), DB2 Warehouse Integration Specialist, IBM

Maksym Petrenko photoMaksym Petrenko joined the IBM Toronto Lab in 2001 and since then has worked exclusively with DB2 software. He has been a developer, technical support analyst, lab services consultant, beta enabler, and data warehouse integrator. His experience includes supporting clients with installation, configuration, application development, and performance issues related to DB2 databases on Windows, Linux, and UNIX platforms. Maksym is a certified DB2 Advanced Database Administrator and DB2 Application Developer.



Paul McInerney (paulmci@ca.ibm.com), User-Centered Design Specialist, DB2 Development, IBM

Paul McInerney photoPaul McInerney has conducted extensive fact-finding interviews with data warehouse customers to learn about their use of DB2 features, including features covered in this article.



Enda McCallig (mccallen@ie.ibm.com), DB2 QA Specialist, IBM

Author PhotoEnda McCallig is a DB2 Data Warehouse QA Specialist in Information Management in the IBM Dublin Lab. He joined IBM in 2009 having previously worked as a database consultant in the banking sector. Since 1997 Enda has worked on database development, design and administration on a variety of platforms. His work is currently focused on testing warehouse features on very large data warehouse environment.



04 February 2013 (First published 13 May 2010)

Also available in Russian

Introduction

This article assumes the following:

  • You have a DB2 DPF environment and are familiar with DB2 DPF concepts.
  • You are either designing a new table that will be hash-partitioned, or you have an existing hash-partitioned table that might have a data skew problem.

This article helps you to accomplish the following tasks:

  • Choose the right initial partitioning key (PK) prior to defining and populating a table
  • Evaluate the quality of the existing PK on a table
  • Evaluate the quality of candidate replacement PKs on an existing table
  • Change the PK while keeping the table online

This article provides the following type of help:

  • Review of concepts and considerations
  • Design guidelines
  • New routines to estimate data skews for existing and new partitioning keys

Quick review of hash partitioning

In DPF environments, large tables are partitioned across multiple database partitions. There are several ways to partition a table, but the focus of this article is distribution by hash. For other ways to partition the table, refer to the article in the Resources section.

Distribution by hash is based on partitioning keys. A partitioning key consists of one or more columns defined at table creation. For each newly inserted record, the partitioning key determines on which database partition this record should be stored. The placement is determined by an internal hashing function that takes the values in the column or columns defined as a partitioning key and returns the database partition number. A hashing function is a deterministic function, which means that for the same partitioning key values it always generates the same partitioning placement, assuming that there are no changes to the database partition group definition.

The following syntax examples demonstrate the steps necessary to create a hash-partitioned table:

  1. Create a database partition group that specifies the database partitions that will participate in the partitioning. The following example illustrates how to create a database partition group PDPG on database partitions 1, 2, 3, and 4:

    CREATE DATABASE PARTITION GROUP pdpg1 ON DBPARTITIONNUMS(1 to 4)

    According to IBM Smart Analytics System and IBM InfoSphere™ Balanced Warehouse best practices, hash-partitioned tables should not be created on the coordinator or administration partition (database partition 0). Database partition 0 is typically used for storing small, non-partitioned lookup tables.

  2. Create the table space in the database partition group. All objects created in this table space will be partitioned across the database partitions specified in the database partition group definition:

    CREATE TABLESPACE tbsp1 IN pdpg1 ...
  3. Create the table in the table space. At this point, the definition of the table is tied to the definition of the database partition group. The only way to change this relationship is to drop the table and recreate it in a different table space that is tied to a different database partition group.

    In the following example, Table1 is created on database partitions 1, 2, 3, and 4, and is redistributed based on a partitioning key on column COL1:


    CREATE TABLE table1(col1 INTEGER NOT NULL, col2 SMALLINT NOT NULL, col3 CHAR(10),
      PRIMARY KEY (col1,col2) ) IN tbsp1DISTRIBUTE BY HASH (col1)

Keep in mind that the database partition group definition can change. For example, new database partitions can be added. If this happens, the hash-partitioned table defined prior the modification will not take advantage of the new partition until the database partition group is redistributed using the REDISTRIBUTE DATABASE PARTITION GROUP command.


Define the partitioning key

The partitioning key is defined using the DISTRIBUTED BY HASH clause in the CREATE TABLE command. After the partition key is defined, it cannot be altered. The only way to change it is to recreate the table.

The following rules and recommendations apply to the partitioning key definition:

  • The primary key and any unique index of the table must be a superset of the associated partitioning key. In other words, all columns that are part of the partitioning key must be present in the primary key or unique index definition. The order of the columns does not matter.
  • A partitioning key should include one to three columns. Typically, the fewer the columns, the better.
  • An integer partitioning key is more efficient than a character key, which is more efficient than a decimal key.
  • If there is no partitioning key provided explicitly in the CREATE TABLE command, the following defaults are used:
    • If a primary key is specified in the CREATE TABLE statement, the first column of the primary key is used as the distribution key.
    • If there is no primary key, the first column that is not a long field is used.

Why choosing the right partitioning key is important

Choosing the right partitioning key is critical for two reasons:

  • It improves the performance of the queries that use hashed partition
  • It balances the storage requirements for all partitions

Data balancing

Data balancing refers to the relative number of records stored on each individual database partition. Ideally, each database partition in a hash-partitioned table should hold the same number of records. If records are stored unequally across the database partitions, it can result in disproportional storage requirements and performance problems. The performance problems in this scenario result from the fact that the query work is done independently on each database partition, but the results are consolidated by the coordinating agent, which must wait until all database partitions return a result set. In other words, the total performance is tied to the performance of the slowest database partition.

Table data skew refers to a difference between the number of records in a table on particular database partitions and the average number of records across all database partitions for this table. So, for example, if the table data skew on database partition 1 is 60% for a particular table, it means that this database partition contains 60% more rows from this table than the average database partition.

From the best practices perspective, the table data skew on every individual database partition should be no more than 10%. To achieve this goal, the partitioning key should be selected on the columns that have high cardinality, or in other words, that contain a large number of distinct values.

If your table statistics are up to date, you can quickly and inexpensively check the cardinality of the columns in your existing table by issuing the following statement:

Listing 1. Checking the cardinality of the columns in an existing table
SELECT colname, colcard FROM syscat.columns
WHERE tabname='CUSTOMER' AND tabschema = 'BCULINUX' ORDER BY colno


COLNAME                                                                  COLCARD
------------------------------------------------------------------------ ---------------
C_CUSTOMER_SK                                                                     100272
C_CUSTOMER_ID                                                                      25068
C_CURRENT_CDEMO_SK                                                                 25068
C_CURRENT_HDEMO_SK                                                                  6912
C_CURRENT_ADDR_SK                                                                  19456
C_FIRST_SHIPTO_DATE_SK                                                              3651
C_FIRST_SALES_DATE_SK                                                               3584
... [remainder of the output omitted from this example]

Collocation

Collocation between two joined tables in a query means that the matching rows of the two tables always reside in the same database partition. If the join is not collocated, the database manager must ship the records from one database partition to another over the network, which results in sub-optimal performance. There are certain requirements that must be met for the database manager to use the collocation join:

  • The joined tables must be defined in the same database partition group.
  • The partitioning key for each of the joined tables must match. In other words, they must contain the same number and sequence of columns.
  • For each column in the partitioning key of the joined tables, an equijoin predicate must exist.

If you choose a partitioning key based on your query workload, the partitioning key should typically consist of either a joined column or a set of columns that is frequently used in many queries.

Although collocated tables typically achieve the strongest performance, it is not possible in practice to collocate all tables. In addition, it is not a good idea to select partitioning keys based on a handful of SQL statements. In decision-support environments, queries can often be unpredictable. In this kind of environment, you should examine your data model to determine the best choice for partitioning keys. The data model and the business relationship between tables can provide a more stable way of selecting a partitioning key than specific SQL statements.

When choosing partitioning keys, draw a data model that shows the relationships among the tables that are in your database. Identify frequent joins and high-use tables. Based on your data model, you should select partitioning keys that favor frequent joins and that are based on a primary key. Ideally, you should collocate frequently joined tables. Another strategy to improve the collocation of the join is to replicate smaller dimensional tables on each database partition.

Collocation compared to data balancing

In some cases, you might find that guidelines for choosing the proper partitioning key based on collocation and data balancing contradict one another. In such cases, it is recommended that you choose a partitioning key based on even data balancing.


Validate the partitioning keys on existing tables

If you want to validate how good your partitioning keys are, you should check to see if queries in your workload are collocated and if the data is balanced properly. It is also possible that over time, as your data changes, old partitioning keys become less optimal than they were previously. You can check the collocation in the query joins by looking at the access plan generated by DB2 Explain. If the query is not collocated, you typically will see the TQUEUE (table queue) operator feeding the join, as shown in Figure 1:

Figure 1. Explain graph that includes a TQUEUE operator
Explain output shows query going throug TQUEUE operator prior to NLJOIN

To check if the data in the table is balanced properly across the database partitions, you can run a simple count on your table grouped by the database partition ID with the help of the DBPARTITIONNUM function.

You can also use the custom stored procedure ESTIMATE_EXISTING_DATA_SKEW routine (available in the Download section), which provides more user-friendly output, including a list of database partitions, the skew percentage as compared to the average, and more. This routine can be run on a sample of the original data for faster performance. (See the Appendix for a full routine description.)

If you are planning to run this routine in a production environment, consider running it during a maintenance window or when the system is under a light load. You may also want to try it on one of the smaller tables with the sample value of 1% to get an estimate of how long it takes to return results. The total execution time is included at the bottom of the report.

Example 1

This example tests the data skew in a scenario in which the partitioning key was changed to S_NATIONKEY. This example uses only 25% of the data in the sampling. As you can see from the output, the data has some extensive skewing, and data volumes in some database partitions are 60% skewed.

Listing 2. Measuring the existing data skew for a single table
$ db2 "set serveroutput on"
$ db2 "CALL estimate_existing_data_skew('TPCD', 'SUPPLIER', 25)"
CALL estimate_existing_data_skew('TPCD', 'SUPPLIER', 25)

  Return Status = 0

DATA SKEW ESTIMATION REPORT FOR TABLE: TPCD.SUPPLIER
Accuracy is based on 25% sample of data
------------------------------------------------------------------------
TPCD.SUPPLIER
Estimated total number of records in the table: : 19,994,960
Estimated average number of records per partition : 2,499,368

Row count at partition 1 : 1,599,376 (Skew: -36.00%)
Row count at partition 2 : 2,402,472 (Skew: 3.87%)
Row count at partition 3 : 4,001,716 (Skew: 60.10%)
Row count at partition 4 : 2,394,468 (Skew: -4.19%)
Row count at partition 5 : 1,600,028 (Skew: -35.98%)
Row count at partition 6 : 1,599,296 (Skew: -36.01%)
Row count at partition 7 : 2,397,116 (Skew: -4.09%)
Row count at partition 8 : 4,000,488 (Skew: 60.05%)

Number of partitions: 8 (1, 2, 3, 4, 5, 6, 7, 8)
------------------------------------------------------------------------

Total execution time: 20 seconds

Example 2

This example demonstrates the usage of the wildcard character in the ESTIMATE_EXISTING_DATA_SKEW routine. Listing 3 shows a report for the existing data skew on all tables that have schema TPCD and a table name that starts with "PART." Since the tables are relatively large, the sample is built on 1% of the data to reduce the performance cost.

Listing 3. Measuring the existing data skew for multiple tables
$ db2 "set serveroutput on"
$ db2 "CALL estimate_existing_data_skew('TPCD', 'PART%', 1)"
CALL estimate_existing_data_skew('TPCD', 'PART%', 1)

  Return Status = 0

DATA SKEW ESTIMATION REPORT FOR TABLE: TPCD.PART%
This report is based on the existing partitioning key
Accuracy is based on 1% sample of data
------------------------------------------------------------------------
TPCD.PART
Estimated total number of records in the table: : 399,799,400
Estimated average number of records per partition : 49,974,900

Row count at partition 1 : 50,051,800 (Skew: 0.15%)
Row count at partition 2 : 49,951,200 (Skew: -0.04%)
Row count at partition 3 : 49,862,500 (Skew: -0.22%)
Row count at partition 4 : 49,986,500 (Skew: -0.02%)
Row count at partition 5 : 50,096,400 (Skew: 0.24%)
Row count at partition 6 : 49,993,900 (Skew: -0.03%)
Row count at partition 7 : 49,955,900 (Skew: -0.03%)
Row count at partition 8 : 49,901,200 (Skew: -0.14%)

Number of partitions: 8 (1, 2, 3, 4, 5, 6, 7, 8)
------------------------------------------------------------------------
TPCD.PARTSUPP
Estimated total number of records in the table: : 1,600,374,100
Estimated average number of records per partition : 200,046,700

Row count at partition 1 : 200,298,100 (Skew: 0.12%)
Row count at partition 2 : 200,154,900 (Skew: 0.05%)
Row count at partition 3 : 200,006,700 (Skew: 0.01%)
Row count at partition 4 : 199,831,600 (Skew: -0.10%)
Row count at partition 5 : 199,962,200 (Skew: -0.04%)
Row count at partition 6 : 200,083,900 (Skew: 0.01%)
Row count at partition 7 : 199,910,300 (Skew: -0.06%)
Row count at partition 8 : 200,126,400 (Skew: 0.03%)

Number of partitions: 8 (1, 2, 3, 4, 5, 6, 7, 8)
------------------------------------------------------------------------
TPCD.SUPPLIER
Estimated total number of records in the table: : 20,000,000
Estimated average number of records per partition : 2,500,000

Row count at partition 1 : 2,498,411 (Skew: -0.06%)
Row count at partition 2 : 2,498,837 (Skew: -0.04%)
Row count at partition 3 : 2,500,996 (Skew: 0.03%)
Row count at partition 4 : 2,500,170 (Skew: 0.00%)
Row count at partition 5 : 2,501,254 (Skew: 0.05%)
Row count at partition 6 : 2,499,654 (Skew: -0.01%)
Row count at partition 7 : 2,501,429 (Skew: 0.05%)
Row count at partition 8 : 2,499,249 (Skew: -0.03%)

Number of partitions: 8 (1, 2, 3, 4, 5, 6, 7, 8)
------------------------------------------------------------------------

Total execution time: 51 seconds

Evaluate the quality of candidate replacement PKs on the existing table

If you decide to change an existing partitioning key, it is important to determine if the new partitioning key that you are considering will result in good query collocation and evenly distributed data.

To check for query collocation, it is recommended that you collect the queries that characterize your workload, place them in a file, and then run a db2advis report to get recommendations on the new partitioning keys:

db2advis -d <database name> -i <workload file> -m P

You can also run a report based on the recently executed queries that still reside in the package cache using the following form of the db2advis utility:

db2advis -d <database name> -g -m P

Listing 4 provides an example db2advis output:

Listing 4. db2advis output
bculinux> db2advis -d tpcds -g -m P
Using user id as default schema name. Use -n option to specify schema
execution started at timestamp 2010-04-06-11.33.04.271678
Recommending partitionings...
Cost of workload with all recommendations included [1761.000000] timerons
1 partitionings in current solution
 [1761.0000] timerons  (without recommendations)
 [1736.0000] timerons  (with current solution)
 [1.42%] improvement


--
--
-- LIST OF MODIFIED CREATE-TABLE STATEMENTS WITH RECOMMENDED PARTITIONING KEYS AND
TABLESPACES AND/OR RECOMMENDED MULTI-DIMENSIONAL CLUSTERINGS
-- ===========================
-- CREATE TABLE "BCULINUX"."ITEM" ( "I_ITEM_SK" INTEGER NOT NULL ,
--       "I_ITEM_ID" CHAR(16) NOT NULL ,
--       "I_REC_START_DATE" DATE ,
--       "I_REC_END_DATE" DATE ,
--       "I_ITEM_DESC" VARCHAR(200) ,
--       "I_CURRENT_PRICE" DECIMAL(7,2) ,
--       "I_WHOLESALE_COST" DECIMAL(7,2) ,
--       "I_BRAND_ID" INTEGER ,
--       "I_BRAND" CHAR(50) ,
--       "I_CLASS_ID" INTEGER ,
--       "I_CLASS" CHAR(50) ,
--       "I_CATEGORY_ID" INTEGER ,
--       "I_CATEGORY" CHAR(50) ,
--       "I_MANUFACT_ID" INTEGER ,
--       "I_MANUFACT" CHAR(50) ,
--       "I_SIZE" CHAR(20) ,
--       "I_FORMULATION" CHAR(20) ,
--       "I_COLOR" CHAR(20) ,
--       "I_UNITS" CHAR(10) ,
--       "I_CONTAINER" CHAR(10) ,
--       "I_MANAGER_ID" INTEGER ,
--       "I_PRODUCT_NAME" CHAR(50) )
-- ---- DISTRIBUTE BY HASH("I_ITEM_SK")
-- ---- IN "HASHTS"
-- DISTRIBUTE BY HASH (I_ITEM_SK)
-- IN USERSPACE1
-- ;
-- COMMIT WORK ;

-- ===========================

To check if the data would be properly balanced using the new partitioning key, you can use the routine ESTIMATE_NEW_DATA_SKEW that is also provided in the Download section. This routine creates a copy of your existing table with the new partitioning key and loads it partially or fully with the data from the original table. It then runs the same report for the existing data skew estimation and, at the end. drops the copy table. Note that the table space containing the original table must be able to hold a minimum of 1% of the data from the original table since the copied version is created in the same table space.

Example 3

This example tests the data skew in a scenario in which the partitioning key was changed from S_NATIONKEY to S_ID. This example uses 100% of the data in the sampling. As this example demonstrates, the new partitioning key causes minimal data skew and is a much better choice than the original S_NATIONAL key from Example 1.

Listing 5. Estimating the data skew for a new partitioning key
$ db2 "set serveroutput on"
$ db2 "CALL estimate_new_data_skew('TPCD', 'SUPPLIER', 'S_ID', 100)"
CALL estimate_new_data_skew('TPCD', 'SUPPLIER', 'S_ID ', 100)

  Return Status = 0

DATA SKEW ESTIMATION REPORT FOR TABLE: TPCD.SUPPLIER
This report is based on the new partitioning key: S_NATIONKEY
Accuracy is based on 100% sample of data
------------------------------------------------------------------------
TPCD.SUPPLIER
Estimated total number of records in the table: : 20,000,000
Estimated average number of records per partition : 2,500,000

Row count at partition 1 : 2,498,411 (Skew: 0.06%)
Row count at partition 2 : 2,498,837 (Skew: 0.04%)
Row count at partition 3 : 2,500,996 (Skew: 0.03%)
Row count at partition 4 : 2,500,170 (Skew: 0.00%)
Row count at partition 5 : 2,501,254 (Skew: 0.05%)
Row count at partition 6 : 2,499,654 (Skew: 0.01%)
Row count at partition 7 : 2,501,429 (Skew: 0.05%)
Row count at partition 8 : 2,499,249 (Skew: 0.03%)

Number of partitions: 8 (1, 2, 3, 4, 5, 6, 7, 8)
------------------------------------------------------------------------

Total execution time: 20 seconds

Change the PK while keeping the table online

A new routine in DB2 9.7 named ADMIN_MOVE_TABLE allows you to automatically change the partitioning key of a table while keeping the table fully accessible for reads and writes. In addition to the partitioning key change, this procedure can move the table to a different table space, change column definitions, and more.

Example 4

This example changes the partitioning key of the TPCD.PART table from COL1 to (COL2, COL3). It uses the LOAD option to improve the performance of the ADMIN_MOVE_TABLE routine.

Listing 6. Changing partitioning keys
CALL SYSPROC.ADMIN_MOVE_TABLE
      ('TPCD', 'PART', '', '', '',
       '', 'COL2, COL3', '', '',
       'COPY_USE_LOAD, FORCE', 'MOVE')

Result set 1
  --------------

  KEY                              VALUE
  -------------------------------- ----------------------------
  AUTHID                           TPCD
  CLEANUP_END                      2010-03-12-12.40.17.360000
  CLEANUP_START                    2010-03-12-12.37.43.297000
  COPY_END                         2010-03-12-12.37.42.704000
  COPY_OPTS                        OVER_INDEX,LOAD,WITH_INDEXES
  COPY_START                       2010-03-12-11.18.40.563000
  COPY_TOTAL_ROWS                  400000000
  INDEX_CREATION_TOTAL_TIME        0
  INDEXNAME                        PROD_ID_PK
  INDEXSCHEMA                      TPCD
  INIT_END                         2010-03-12-12.59.40.266000
  INIT_START                       2010-03-12-12.40.39.172000
  REPLAY_END                       2010-03-12-11.18.43.125000
  REPLAY_START                     2010-03-12-11.18.42.704000
  REPLAY_TOTAL_ROWS                0
  REPLAY_TOTAL_TIME                0
  STATUS                           COMPLETE
  SWAP_END                         2010-03-12-11.18.43.250000
  SWAP_RETRIES                     0
  SWAP_START                       2010-03-12-11.18.43.125000
  VERSION                          09.07.0000

21 record(s) selected.

While the ADMIN_MOVE_TABLE procedure is running, the TPCD.PART table is fully accessible and the change to the partitioning key is transparent to the end users.


Conclusion

Choosing appropriate partitioning keys is essential for optimizing database performance in a partitioned environment based on DB2 software. This article provided guidance and tooling for choosing the best partitioning keys based on your needs.

This article described:

  • The concepts related to partitioning keys, and the rules and recommendations for creating partitioning keys
  • Routines that can help you estimate the data skew for new and existing partitioning keys
  • How to change partitioning keys while keeping the table fully accessible

Appendix: Routine reference documentation

Prerequisites

Both the ESTIMATE_EXISTING_DATA_SKEW and the ESTIMATE_NEW_DATA_SKEW procedures are supported in DB2 9.7 or later. The routine ADMIN_MOVE_TABLE that is used for the actual movement of the table is shipped with the core DB2 9.7 product or later. For the ESTIMATE_NEW_DATA_SKEW routine, there must be enough free space in the table space that contains the original table to store the sampling data.

Deployment instructions

  1. Download and save the estimate_data_skew.sql file found in the Download section.
  2. Connect to the database from the command line and deploy the routines using the following command:
    $ db2 -td@ -vf estimate_data_skew.sql

ESTIMATE_NEW_DATA_SKEW procedure

The ESTIMATE_NEW_DATA_SKEW routine estimates the data skew of individual database partitions on an existing table with a new partitioning key. To improve the performance and lower the storage requirements of this routine, the estimation can be based on a subset of the data using extremely fast sampling on the page level.

Syntax
>>-ESTIMATE_DATA_SKEW--(--tabschema--,--tabname--,---------------->
>--new_partitioning_keys--,--sampling_percentage--+---------------+--)---->
                                                  +,--csv_format--+

Procedure parameters

csv_format (optional)
This optional input parameter is used to request the format in which the data will be returned. A value of 'Y' for this parameter requests that the procedure will return the data in CSV format. This CSV formatted data is then under the headings: SCHEMA, TABLE, PARTITION, SAMPLE%, TABLEROWCOUNT, PARTAVG, PARTROWCOUNT and SKEW. The parameter defaults to 'N' for regular format.
in_tabschema
This input parameter specifies the name of the schema that contains the table to be estimated for data skews. This parameter is case-sensitive and has a data type of VARCHAR(128). This parameter does not support wildcards.
in_tabname
This input parameter specifies the name of the table to be estimated for data skews. This parameter is not case-sensitive and has a data type of VARCHAR(128). This parameter does not support wildcards.
new_partitioning_keys
This input parameter specifies the new partitioning keys to be used in the estimation of data skews.
sampling_percentage
This input parameter specifies the percentage of data to be used in the data skew estimation. Valid values are 1 to 100, where 100 means that the stored procedure will use all records in the table for the estimation. The purpose of this parameter is to improve performance and minimize the space usage when estimating data skew with new partitioning keys. If performance and disk space are not an issue, specify 100 for this value.

ESTIMATE_EXISTING_DATA_SKEW procedure

The ESTIMATE_EXISTING_DATA_SKEW stored procedure estimates the data skew of individual database partitions in one or more tables based on the existing partitioning keys. To improve the performance of this procedure, the estimation can be based on a subset of the data using extremely fast sampling on the page level.

Syntax
>>-ESTIMATE_EXISTING_DATA_SKEW--(--in_tabschema--,--in_tabname--,->
>--sampling_percentage--+---------------+--)---->
                        +,--csv_format--+

Procedure parameters

csv_format (optional)
This optional input parameter is used to request the format in which the data will be returned. A value of 'Y' for this parameter requests that the procedure will return the data in CSV format. This CSV formatted data is then under the headings: SCHEMA, TABLE, PARTITION, SAMPLE%, TABLEROWCOUNT, PARTAVG, PARTROWCOUNT and SKEW. The parameter defaults to 'N' for regular format.
in_tabschema
This input parameter specifies the name of the schema that contains the table to be estimated for data skews. This parameter is case-sensitive and has a data type of VARCHAR(128). This parameter supports % as a wildcard. If the NULL value is specified, a report will be run for all schemas defined in the database.
in_tabname
This input parameter specifies the name of the table to be estimated for data skews. This parameter is not case-sensitive and has a data type of VARCHAR(128). This parameter supports % as a wildcard.
sampling_percentage
This input parameter specifies the percentage of data to be used in the data skew estimation. Valid values are 1 to 100, where 100 means that the stored procedure will use all records in the table for the estimation.

Download

DescriptionNameSize
Sample SQL script for this articleestimate_data_skew.zip4KB

Resources

Learn

Get products and technologies

  • DB2 9.7 for Linux, UNIX, and Windows: Download a free trial version of DB2 9.7 for Linux, UNIX, and Windows.
  • Build your next development project with IBM trial software, available for download directly from developerWorks.
  • Evaluate IBM products in the way that suits you best: Download a product trial, try a product online, use a product in a cloud environment, or spend a few hours in the SOA Sandbox learning how to implement Service Oriented Architecture efficiently.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=489237
ArticleTitle=Choosing partitioning keys in DB2 Database Partitioning Feature environments
publish-date=02042013