Skip to main content

skip to main content

developerWorks  >  Information Management  >

Migrate DB2 applications to a partitioned database

developerWorks
Document options

Document options requiring JavaScript are not displayed


Rate this page

Help us improve this content


Level: Intermediate

Bill Wilkins (wilkins@ca.ibm.com), Partner Enablement, IBM Information Management, IBM Canada
Peter He (peterzh@ca.ibm.com), Database Consultant, Partner Enablement, IBM 

13 May 2004
Updated 24 Apr 2008

Using a partitioned database allows you to maximize the scalability and capacity of DB2 for Linux, UNIX, and Windows. This article discusses what DB2 database partitioning is, outlines the benefits and costs of using it, and helps you decide whether or not to use a partitioned database, as well as how to migrate to one. The focus is on application migration, but we also discuss design, configuration, and operational considerations for a partitioned database.

Introduction: What is a DB2 partitioned database?

The DB2 SQL Reference manual states that, "a partitioned relational database is a relational database whose data is managed across multiple partitions (also called nodes). This separation of data across partitions is transparent to users of most SQL statements."

The Database Partitioning Feature (DPF) is required to partition a DB2 Enterprise Server Edition (ESE) database, either within a single server or across multiple servers. The DPF is actually just a license, and does not require any additional products to be installed on top of DB2 ESE.

Note

Prior to DB2 UDB Version 8, database partitioning support was provided through DB2 Enterprise-Extended Edition (EEE), which was an installable product.

This article is based on DB2 V9.1 for Linux, UNIX, and Windows. Changes to DPF could occur in future.

If you aren't familiar with the basics of DPF, you should read the DB2 Information Center material on partitioning (see Resources); or the developerWorks article entitled "A Colorful Introduction to DPF - Overview of the Database Partitioning Feature."

DPF concepts and terminology

This article assumes that you understand the basic architecture of a DB2 partitioned database environment, but let's quickly review the key concepts behind DPF.

A DB2 installation is designated as partitioned by "adding systems" configured with DB2. The procedure of adding systems is different on the UNIX and Linux operating systems than on Windows, but the result on all is a file, db2nodes.cfg, that lists the systems in the configuration and the partition(s) on each. The term node is sometimes a source of confusion. It often means an entire system within a partitioned configuration, but it sometimes refers to a partition itself. For example, a single system might have multiple partitions defined on it. To avoid any ambiguity, we avoid using the term node and use partition or database partition instead.

When a database is created in a partitioned environment, the system catalog tables are created on the partition that the CREATE DATABASE command was executed on; that partition is then known as the catalog partition. A database partition group, is essentially a list of partitions and an associated partitioning map, which is an array of 4096 entries, each with a value corresponding to one of the partition numbers in the database partition group. Each database partition group can include anywhere from one to all of the partitions in the instance.

When a new database is created, three database partition groups are defined, the most significant of which, IBMDEFAULTGROUP, includes all of the partitions. The database partition group for the system catalog tables, IBMCATGROUP, contains only the catalog partition. Each table space is created in a particular database partition group. The default table space for user data, USERSPACE1, is created in IBMDEFAULTGROUP and therefore spans all partitions.

If a table is created in a table space whose database partition group has one partition, all of the rows in the table will be stored on that partition, like in a non-partitioned database configuration. In this article, we'll call such a table a non-partitioned table. If a table is created in a table space whose database partition group has multiple partitions, the rows in the table will be dispersed among the partitions. Each row will be stored entirely within one partition. We'll call such a table a partitioned table, and an application that uses such tables a partitioned application.

A partitioning key is a set of one or more columns that must be defined (explicitly or by default) for every partitioned table. For such tables, the determination of which partition a new row is inserted into is made by hashing the row's partitioning key value(s) to an entry in the partitioning map, which contains the partition number to use.

The key performance goal for database design in a partitioned environment is collocation. This is where the joining of tables is done within one or more individual partitions, in contrast with directed or broadcast joins, which involve rows being sent to specific or all partitions, respectively. We'll discuss these techniques later.

One way to achieve collocation is by defining a particular small table as being replicated, which means that a full copy of the table is maintained on each partition and therefore its rows never have to be sent between partitions to do a join. DPF is a shared-nothing environment, which means that resources are not shared among partitions. Each partition has its own subset of data, and the indexes on the partition have entries for only the rows in that partition. Each partition has its own buffer pool(s), database configuration parameter settings, log files, lock manager, etc.. If partitions are added to a populated partitioned database, the data in each database partition group can be redistributed to balance the number of rows per table as evenly as possible across all partitions.

When an application connects to a partitioned database, it is connected to one particular partition (which can be chosen by the application), and that partition is known as the coordinator partition. The coordinator agent for the application runs on that partition and is responsible for satisfying the application's requests, accessing any or all of the partitions in the database, as appropriate.



Back to top


Why use a partitioned database?

There are three primary reasons why you might want to use a partitioned database:

  • To increase data capacity, or improve throughput or response times, by exploiting the hardware resources of additional systems.
  • To fully exploit the memory on a 32-bit system.
  • To circumvent DB2 limits on the number of bytes and rows in a nonpartitioned table (and per partition, for a partitioned table) - not so much a factor anymore with DB2 V9.1 for Linux, UNIX, and Windows.
What partitioning primarily gives you is the ability to exploit hardware, and the features specific to DPF support and optimize that exploitation without adding unique functionality that would require application changes to use.

Exploit multiple systems

This is the most common reason for using partitioning. Some application workloads simply impose too much of a burden for large single systems to handle with acceptable performance. For example, the CPU or memory capacity of the single system may be inadequate, and upgrading the system might still be insufficient, or too expensive. Or perhaps the data volume is so great that the system's I/O capacity is exceeded during utility operations, and performance suffers.

Using database partitioning, adding new systems and partitions, and redistributing the data to utilize the new partitions, allows DB2 to exploit the extra hardware resources. With good database design and the DPF's shared-nothing approach, very good performance and scalability improvements can be achieved. These improvements can take various forms, but common ones of interest are increased throughput of OLTP applications, without negatively impacting response times, and reduced elapsed time of BI/DSS queries (or the ability to handle more queries without increasing elapsed times).

Let's look at an example that's a typical best-case scenario for DPF. Suppose table T1 has 100M rows in a single-partition database on system S1. We run SELECT COUNT(*) from T1, and it takes N seconds to complete. We then add three additional systems (exactly like S1) to the configuration, with one partition defined on each system, and redistribute T1's database partition group across all partitions. Running the same SELECT statement again would now be able to use the CPU, memory, and disk I/O resources of four systems instead of one, and should run in just over (N * 0.25) seconds.

Memory exploitation on a single 32-bit system

A 32-bit operating system can provide virtual addressability to at most 4GB of memory in a single process or thread. Since each DB2 agent (which runs as a process or thread) needs access to the Database Global Memory for the database that it's connected to, the Database Global Memory is limited by 32-bit addressability, and the maximum usable is about 1GB to 3.3GB, depending on the platform. The DB2 buffer pools come out of Database Global Memory, and their size is usually the biggest single configuration factor in determining the performance of a database, so the limit on Database Global Memory size can significantly restrict performance.

Keeping in mind that DB2 partitions are essentially independent databases with their own memory allocations, each partition has its own Database Global Memory and therefore uses additional memory. For example, if you have a system with 8GB of memory, adding a second partition (sometimes called a logical partition) on that system would allow you to exploit about twice as much memory and may provide better performance. Note that you can combine this reason for using DPF with the previous one above: i.e., you can exploit memory by having multiple partitions on each of several systems. With the shift towards 64 bit computing, the need for DPF to overcome memory limits will be reduced.

Circumvent limits

Prior to DB2 V9.1, there were practical limits on the amount of data a table could store in each partition. The limit is based on the table's (or, more accurately, its table space's) page size. These limits have been increased, as shown in Table 1.


Table 1. Limits on table size starting in DB2 V9.1
4K page size 8K page size 16K page size 32K page size
Maximum table size per database partition in a regular DMS table space(in GB):64128256512
Maximum table size per database partition in a large DMS table space(in GB):20484096819216384
Maximum regular DMS tablespace size per partition (in GB):64128256512
Maximum large DMS tablespace size per partition (in GB):20484096819216384

Another limitation that was eased was the 4 billion rows per table per partition. This limit is now is 1.1 trillion rows. With the relaxation of these limits, the need to move to a DPF configuration may not be as accelerated as it used to be. These and other DB2 limits are documented in Appendix A, "SQL Limits", of the SQL Reference manual.



Back to top


Is a partitioned database appropriate for your application?

Now that you understand the benefits of a partitioned database, let's discuss how to decide whether to use one or not. By the time you work through this section and do some analysis of your application's current and projected behavior, it may be fairly clear whether or not partitioning is worth pursuing for you. In some cases you may need to do some prototyping with a subset of your application, using the material in Do the migration.

Consider if any of the three reasons above (under Why use a partitioned database?) apply, now or in the foreseeable future. To do this requires a basic understanding of your application running against a nonpartitioned database. If your application currently doesn't run on DB2 at all, we suggest that you first get it running without partitioning and study its performance with a reasonable database size. The discussion is in terms of the present, but remember to consider possible future changes such as increased data volume.

  • Exploit multiple systems:
    To evaluate this possibility you need to study your current performance and determine what the constraints are. If the CPU, I/O, and memory usage are well under system capacity, additional systems will provide little benefit. Even if some resource is close to being exhausted, you may be able to upgrade your system instead of adding more systems. To evaluate future needs, understand the timings of your key tasks and try to evaluate how database expansion will affect them. The elapsed times for some operations, such as table scans and database backups, should scale almost linearly as the table or database grows (assuming no tuning), and you can assume that using multiple systems will reduce the elapsed times almost linearly, through the "divide and conquer" approach. At the other extreme, tasks like SELECT statements using primary keys are affected very little by table size, so for these tasks there's little need for multiple systems.
  • Memory exploitation on a single 32-bit system:
    There are a few requirements that must be met for DPF to help by supporting multiple partitions on a 32-bit system:
    • You must have a system with at least 2GB of memory, but usually 4GB or more is needed. The more memory, the higher the likelihood that DPF will help.
    • Performance must be significantly constrained on I/O or large sorts. By going to multiple partitions with their own buffer pools and sort heaps, more memory can be exploited. However, if the system already runs at nearly 100% CPU utilization and very rarely waits for I/O, using more buffer pool space will provide no advantage, as there will be an insignificant reduction in the amount of I/O; also, adding partitions to a CPU-bound system (without adding any CPU power) would be more likely to hurt performance than help it. If large sorts are a concern, having multiple partitions can help such tasks as index creation, which involves sorting in each partition in parallel. On the other hand, some queries do not lend themselves to parallel sorting on multiple partitions, and in such cases having one larger sort heap, not more of them spread across partitions, is all that will help.
    • Even if I/O is significant, if most of it is in the form of direct reads and writes of LOB or LONG data (which is not buffered in the buffer pool), having multiple buffer pools will not reduce I/O.
    • Having multiple partitions can help if log writing is a bottleneck and the partitions' log files are written to separate disks.
    Moving to a 64-bit database server will allow all system memory to be exploited. However, there are issues with this approach in the area of interoperability (Some third-party software may not yet support 64-bit DB2 server, so this approach may not be feasible for some customer environments. Still, you may want to consider 64-bit as an alternative to having a partitioned database on a single system or even a few separate systems.
  • Circumvent limits:
    It should be pretty obvious if this is relevant. Run "runstats on table <schema>.<table>" for every table (or "reorgchk update statistics on table all"). Then run a query such as:
    "SELECT tabname, tabschema, tbspaceid, card, npages FROM syscat.tables ORDER BY tbspaceid, tabschema, tabname"
    to see how many rows and pages are in use per table. (To find the minimal number of pages needed, do a REORG of tables before updating statistics.) You may need to check these figures over time, to study growth rates.
    If adding partitions to avoid these limits would be your only reason to use DPF, it's worth stopping to consider whether you may instead be able to use DB2 9's table partitioning, larger page sizes, or split tables horizontally or vertically into multiple physical tables and "combine" them via UNIONs or views. These approaches generally are less costly than using DPF.

We'll discuss how to design an application and database to use DPF in more detail shortly, but the key facts behind such design work are:

  • each table can have only one partitioning key;
  • a primary design goal for large tasks is to have the work done in parallel over all partitions;
  • a primary design goal for small tasks is to perform the work on as few partitions as possible, ideally one; and
  • a goal for all tasks is to minimize communication between partitions.

Those facts largely determine the types of database usage that are most and least appropriate for DPF.

  • An ideal scenario for partitioning is a statement like "SELECT COUNT(*) FROM big_table". Placing this table on all partitions allows each partition to count its rows and send the subtotal to the coordinator partition for computing the overall count, and the communication cost is negligible compared to the work done within each partition.
  • Another good scenario is where one large table is joined with several very small tables that are infrequently updated. The large table would be partitioned, and the small ones replicated on all partitions, enabling the joins to be collocated.
  • A good OLTP scenario is where multiple tables are involved in the key transactions, but the tables all have a common column that is used as partitioning key, and processing is based off a value of that column. When this occurs, all of the processing can occur within one partition. An example might be a set of tables, ORDER, STORE, and SHIPMENT, with STORE_ID as the partitioning key for each of them. When a customer makes a purchase, an insert could be done in ORDER, an update of the STORE row (column TOTAL_SALES), and an insert in SHIPMENT, all within the same partition.
  • Poorly suited for partitioning are ad-hoc query environments with many large tables and a wide variety of tables and columns involved in joins. In such situations, it can be difficult or impossible to choose the tables' partitioning keys such that all significant queries can be executed without heavy inter-partition communication.
  • Also bad for partitioning are transactions with multiple very light statements that can't be processed within a single partition. In this case, the overhead of communication between partitions is very high relative to the cost of executing the statements locally, and response times could be significantly worse with partitioning (especially across multiple physical systems). However, the extra throughput that could be achieved may be worth some degree of sacrifice in response times.
  • Most workloads and specific tasks fall between the two types of extremes just discussed, and those are where prototyping may be necessary to study the effects of partitioning.

Expanding upon the last point, be aware that the scalability of different tasks will vary greatly as you move from a non-partitioned to a partitioned database. At the positive extreme, queries such as: "SELECT COUNT(*) FROM big_partitioned_table" and tasks like CREATE INDEX and REORG will run very close to N times faster as you move from one to N partitions (assuming each partition has the same resources and the data volume remains constant). At the negative extreme are situations where either (a) partitioning introduces extra work, or (b) at least some of the work must be done serially. An example of both of these, covered in more detail later in Populate tables, is LOAD. For a load into two partitions, the data must be partitioned (hashed), sent to the appropriate partition, and then loaded. The first two steps represent additional work that doesn't occur if the same table was non-partitioned, and that work must be done before the actual load operation occurs. Some tuning is possible, but it's not unusual for a load operation to be slower with two partitions than with one. Another example is performing BACKUP and RESTORE operations on a partitioned database. These processes can run in parallel on every partition, with the exception that the catalog partition must always be backed up or restored before the others. It is beyond the scope of this article to cover every DB2 task, but you should consider which ones are most critical to your application and estimate, or prototype and measure, the effect of partitioning on them.

You also need to consider the costs of using DPF. The clearest cost is the (per processor) license charge for DPF, which is in addition to the charge for ESE. Because prices can vary, we won't get specific here, but you can assume that DPF adds about 30% per processor to the ESE charge. Other costs include hardware and software configuration, partitioning key and nodegroup design and prototyping, all forms of testing, and generally higher operational costs due to the added complexity of partitioned environments.

Before we provide more details about migrating an application to a partitioned environment, you may want to scan through Appendix A: DB2 features applicable to DPF. This has a comprehensive list of aspects of DB2 that are pertinent to DPF, and may provoke additional thoughts on whether DPF is or isn't appropriate for you.



Back to top


Do the migration

Let's assume you've decided that changing your application and its related database to use DPF seems like a good thing, or you at least want to investigate further. Now, what steps should you follow to do the migration?

The steps are described as though you were changing your entire application and database to support database partitioning, but it's almost always a good idea to first choose an important subset and prototype it. Most of the following steps are not really independent of the others: in most cases, developing a partitioning design will involve an iterative process of refinement. Keep things simple while you develop an understanding of partitioning and the behavior of your application with it.

As you complete your initial testing and move into production-like configurations, you will need to work through the steps again with the different environments in mind. Most of the content here applies to both test and production environments, but there are particular considerations given for one or the other.

  1. Decide what hardware to use and how many partitions to have. See Choose system and number of partitions.
  2. Create a partitioning environment, including database. See Create a partitioning environment.
  3. Design your database partitioning groups and decide which tables are to be partitioned. See Plan database partition groups and table partitioning.
  4. Plan the associations between tables, table spaces, and database partition groups. See Assign tables to table spaces and database partition groups.
  5. Choose the partitioning keys for your partitioned tables. See Choose your partitioning keys.
  6. Create database partition groups, table spaces, tables, and other database objects. This step includes changing your DDL to reflect your design decisions from the previous steps, then executing that DDL. Try to minimize the differences between your DDL for non-partitioned and partitioned environments. Since you can create your tables with partitioning keys even in a non-partitioned environment, in most cases you will be able to limit the DDL differences to database partition group and table space definitions.
  7. Change the application to deal with partitioning. Ideally your application designer(s) will do this step in parallel with the above, more DBA-oriented, steps. See Tailor the application to a partitioned environment.
  8. Configure the registry, instance, and database configuration parameters. See Perform registry, DBM and DB configuration.
  9. Populate your tables. See Populate tables.
  10. Perform functional testing. Obviously, functional testing is a key component of any migration project. In the case of a migration to a partitioned environment, your testing should include a major subset of the standard tests used for your non-partitioned application. Given the transparency of DPF from an application point of view, this type testing should find very few problems. You should heavily test the changes that have been made to specifically support partitioning, using both single-user and multi-user tests. Once the basics have been tested, you should attempt to produce error situations and see how your application deals with them. There are more things that can go wrong in a partitioned database environment, and you should make sure you know how to deal with them before going into production. For example, issue the following command: db2stop force dbpartitionnum 2 and see what happens when partition 2 is shut down.
  11. Do performance testing. See Performance test.
After your application has been fully tested, your focus should shift to operational matters. Continue with Operational and miscellaneous considerations.

Choose system and number of partitions

If you have unused capacity on a system that is already used as a test environment for your non-partitioned DB2 application, you may want to keep things simple and start your DPF work on that system. It's recommended that the system have at least four processors and 1GB of memory. For early prototyping work you should begin with two or four partitions. With two partitions you will have minimal setup cost and still be able to exercise all of the partitioning functions. Four partitions will require a bit more setup but will give you a better idea of scalability issues and be closer to your final configuration. From the standpoint of early identification of issues, a preferable starting configuration would be two physical machines with two partitions on each. That would force some of the communication between partitions to go through the network and make performance issues more obvious. If you have no capacity on your current machines, we suggest that you minimize your hardware expenses and use one or two small-scale systems until you know how your partitioned application performs.

Another consideration is whether you should have either or both of the two types of special dedicated partitions. These are only needed for production environments, and should only be considered when the number of partitions is more than ten. The first type is a dedicated catalog partition, containing the catalog tables and no user data. This is fairly common where BACKUP and RESTORE times are critical. As mentioned earlier, both of these operations must completely process the catalog partition before starting on the other partitions. By not having user data on the catalog partition, backup or restore can finish more quickly and only minimally delay the start of the (parallel) operations against the other partitions. This approach can also be appropriate when there is a high volume of queries against the catalog tables.

The second type of possible dedicated partition is a coordinator partition. There can be more than one of these, and the idea is that if a high volume of data is routinely transferred through the coordinator partition(s) for your user connections, it might consume most of the CPU on those partitions and slow down data access. Having partitions that do nothing but act as coordinators may make sense, but it would need to be studied carefully. If you do use these partitions, the nodes they're on will require less physical memory than nodes with the same number of partitions containing user data.

We should stress that the need for special or dedicated partitions is not true in general. For almost every situation, it's highly recommended that every system in the partitioning configuration be as identical as possible, in terms of hardware, software, and database layout. Adhering to that policy will avoid confusion and greatly simplify setup, which we'll discuss shortly.

After you migrate and test your application you should do performance and stress testing with production-level data volumes. Here are some rules of thumb for an ideal configuration for that type of testing (and non-HA production use):

  • One partition for each 100-300GB of raw data (not including indexes).
  • Less than 100GB in one table per partition.
  • One or two CPUs per partition. You should never have more partitions than CPUs.
  • 3-4GB of memory per CPU.
  • At least five disks per CPU, ideally in a good storage subsystem. The total storage capacity should be at least three times the raw data volume.
  • One 2GB fibre channel adapter for every 2 CPUs.
  • 1GB Ethernet Adapter per 4 CPUs. Having the fastest communication possible between partitions is critical for achieving the best performance.

Here's some guidance on the trade-offs in how many partitions to have in a production environment.
With fewer partitions (more CPUs per partitions, and more data per partition), you generally have:

  • the ability to handle more concurrent queries
  • longer single-query run times (since there's more data per partition)
  • better OLTP response times (due to less communication between partitions)

With more partitions you generally have:
  • faster single-query performance
  • faster utility performance (BACKUP, RUNSTATS, REORG, etc.)
  • higher OLTP throughput, but worse response times

Create a partitioned environment

There are two approaches that can be used to create a test partitioned environment: migrate an existing one or start from scratch. It is cleanest to start from scratch using the documented instructions, but let's first consider migrating an existing single-partition environment to DPF.

To migrate an existing environment, you'll need to consider the instance and the database(s). The first step is to either backup or drop the existing database(s), depending on whether you want to keep using them or not. Migrate the instance by installing the license key for DPF (thus allowing partitioning to be used) and running db2iupdt to update the existing instance to support partitioning. After the instance migration, any existing databases should be usable. If you want to use single-partition database(s) from a different system (subject to the normal restrictions on platform compatibility), you can restore them; no re-directed restore is required, unless you are changing the table space containers, but the initial single-partition DPF configuration must have 0 as its partition number.

Once any existing database(s) are in place, you'll need to follow the steps to prepare for and install DB2 ESE and DPF on the additional systems, creating partitions in the same manner as described in the "from scratch" installation (see below). Finally, you'll need to create or alter database partition groups and use the REDISTRIBUTE command to redistribute partitioned tables across the desired partitions.

If you are creating a partitioning environment from scratch, the full details are described in Quick Beginnings for DB2 Servers manual. The response file approach is recommended for simplifying the installation when there are more than a few partitions.

Some of the key aspects of setting up a partitioned environment are:

  • The db2nodes.cfg file is the most critical file to be set up. Information about it can be found in the Administration Guide: Implementation manual, under the topic "Creating a node configuration file". Hopefully you have a high-speed communications interface between the partitions. In this case, the fourth column of db2nodes.cfg (in UNIX) is used to provide the high-speed network name for each partition, for use in inter-partition communication. (The first column gives the name used for client-server communication with the partition. That name is also used for inter-partition communication when the fourth column is empty).
  • DB2 itself is installed on each system, but the sqllib directory (which contains db2nodes.cfg and other files) must be in a (shared) location that is accessible to each node. In the case of UNIX, sqllib is in the instance's home directory.
  • The services file (e.g., /etc/services in UNIX) is another critical file. Each node will have its own version of it and the entries for the partitioned environment must be identical in each version.
  • When you create a partitioned database, it must NOT be created using NFS or a network drive as the path. Similarly, all table space containers and log files must be on disks that are local to their partition. It's a good idea, however, to use the same directory names and structures on every partition, in keeping with the goal of having the partitions be as identical as possible.
  • If you care which partition will be the catalog partition, log in to that partition before running the CREATE DATABASE command.

Plan database partition groups and table partitioning

There are several considerations here:

  • You will almost always want to create at least one single-partition database partition group for your small tables.
  • You will almost always want to use at least one database partition group that comprises all partitions, for your largest table(s). This could be the default, IBMDEFAULTGROUP.
  • The more partitions you have, the higher the likelihood that some tables are too big for a single partition but too small to be spread over all partitions, and the more you would want to create database partition groups over several, but not all, partitions. For example, it would not make sense to spread a 100MB table over 16 partitions, but it's larger than you'd like to have in a single partition, so you may want to put it in a database partition group with a few partitions. You could have several database partition groups, each on a different subset of partitions. One risk with this approach is that you could easily have unbalanced load on different partitions, which goes against one of the objectives of using partitioning: exploit all of the available hardware.
  • In order for collocation to take place, tables must be in the same database partition group. Why not, then, use one database partition group for all your partitioned tables? There are a couple of reasons not to:
    • Small tables should not be partitioned at all, and medium tables should not be spread over too many partitions: each of these mistakes imposes extra overhead for little gain. A rule of thumb is to not partition tables that have fewer than about 100,000 rows and 20MB.
    • The REDISTRIBUTE command operates on an entire database partition group. If you have all your partitioned tables in one database partition group, running the REDISTRIBUTE command will process all tables in one execution, which may take longer than you'd like. This setup also limits your flexibility in changing the partitioning of some tables and not others. So, if you have sets of tables that you know will never be joined together, it's best to have each such set in a separate database partition group (even if they contain the same partitions). There can be any number of database partition groups with the same set of partitions, and any particular partition can be in any number of database partition groups.
    • If you choose to use dedicated catalog or coordinator partitions (as discussed in Choose system and number of partitions), your database partition groups will need to be defined accordingly, by omitting those partitions from the partition lists for the database partition groups for user data and system temporary data.
  • Another point to consider is which (non-partitioned) tables to replicate. The prime candidates are small and medium tables which are infrequently updated and frequently joined with large tables. But, until queries are running on the partitioned database, it often difficult to know which tables should be replicated. A sample replicated table definition is:
    "CREATE TABLE t1_rep AS (SELECT * FROM t1) DATA INITIALLY DEFERRED REFRESH DEFERRED IN all_partition_tbspc REPLICATED"
    Remember to create appropriate index(es) and execute RUNSTATS on the replicated table. Note that if the base table is replicated to a table space that includes the base table's partition, that partition will have two copies of the table (original and replica).

    Materialized query tables (MQTs) can also be replicated. MQTs are pre-computed query results, usually aggregations from large tables, and are usually quite small and therefore good candidates for replication, assuming the MQT can be used. The same considerations apply as for small base tables being replicated.

Assign tables to table spaces and database partition groups

Since tables are assigned to table spaces and table spaces to database partition groups, you will need to plan the assignments with the above considerations in mind. For example, tables that you'll want to have in different database partition groups must be placed in different table spaces, so when it comes to deciding on how many table spaces to have, you should err on the side of too many, allowing yourself maximum flexibility in assigning them to database partition groups (any number of table spaces can be assigned to a database partition group). Another thing to remember is the DMS table space size limit: you will need to avoid grouping tables in the same table space such that the limit will be hit.

If you are designing a non-partitioned application but want to make it easier to support partitioning in future, try to plan the table / table space / database partition group assignments as though you are using DPF. Without DPF, you will not yet be able to define multiple-partition database partition groups, but you can create all of the database partition groups you will eventually need, and when DPF is in use, add partitions to them.

When choosing container names for table spaces in multiple-partition database partition groups, use simple names that are identical on every partitition, except for the partition number; for example, "/db2data/large_tblspace_Pn", where the suffix "n" is the partition number. By doing this you can use the " $N" syntax in CREATE TABLESPACE to greatly simplify your statements, as in the following statement, which will create a container named as above on every partition that the statement is executed on:
"CREATE TABLESPACE ts1 MANAGED BY DATABASE USING (FILE '/db2data/large_tblspace_P $N' 10000)".
(The space before "$N" is required, and gets removed during the substitution.)

Choose your partitioning keys

Given a particular DPF workload and configuration, the biggest single performance factor is the choice of partitioning keys for your tables. However, there may not be one absolute and obvious best set of them, as one set may optimize some queries and another set others, so you have to look at your overall workload and know what's most important to optimize. You may also have to try various alternatives before determining the best choices. Although the task of choosing partitioning keys can be quite onerous, the DB2 design advisor can help you by providing recommendations that can save you weeks of time.

It's very important to be clear that DPF partitioning is based on hashing, with the intent to spread rows evenly over partitions and the available hardware resources to maximize performance. It is not wise to choose partitioning keys in an attempt to implement range partitioning. Let's adapt an earlier example to illustrate this. Suppose we have four partitions. The partitioning key for table T1 is column C1, there are many rows with each distinct value of C1, and the query of interest is:
SELECT COUNT(*) FROM t1 WHERE c1 = ?.
Since C1 is the partitioning key, all rows with the same value for C1 will be in the same partition, which you may think is a good idea, because the rows with the same value of C1 are "together". However, consider the performance of the above SELECT statement. All the work of scanning and counting the rows is being done using one partition's resources, while the other partitions' resources are sitting idle (for this query). The query could be executed almost four times faster if we used a different partitioning key, say, C2, which has values not correlated to those of C1. This would result in each partition holding about 1/4 of the rows with a given value of C1. Note that if there are only a few rows with each value of C1, C1 would be a good choice for the partitioning key.

Here are some of the considerations involved when choosing a partitioning key:

  • Once a table is created as a DPF partitioned table, you can't directly change its partitioning key. To change it, you could use one of these approaches:
    1. Unload the table to a file, drop and re-create the table with the new partitioning key, and reload the table.
    2. Create a new version of the table with a temporary name and the new partitioning key, load the new table from the old one (the fastest way is: "DECLARE mycursor FOR SELECT * FROM t1", followed by "LOAD FROM mycursor OF CURSOR REPLACE INTO t1_new"), drop the old table, rename the new table to the real name, and recreate indexes and re-do other steps as when the table was originally created.
  • With ALTER TABLE you can add or drop a partitioning key, but only for a non-partitioned table.
  • The columns in any unique or primary key constraint defined on the table must be a superset of the partitioning key. For more information on this, look up the ALTER/CREATE TABLE in the documentation.
  • Data types: XML, LOB, and LONG columns cannot be part of a partitioning key. In terms of efficiency, integer columns are preferred, then character, then decimal.
  • Avoid unbalanced distribution of rows across partitions by choosing partitioning key column(s) with high cardinality. If this can't be achieved, try to use column(s) with uniformly distributed values. After a table is populated you can check how many of its rows are in each partition, and how many of its rows map to each entry in the partitioning map (for possible redistribution purposes), by running queries like those in the DBPARTITIONNUM and HASHEDVALUE examples. Unless you have extremely large tables, differences of a few percent in cardinalities per partition can be ignored.
  • Choose partitioning keys with one or a few columns that are used frequently in joins.
  • Partitioning keys should not include columns that are updated frequently. Whenever a partitioning key value is updated, DB2 may need to drop the row and re-insert it into a different partition, as determined by hashing the new partitioning key value.
  • Unless a table is unimportant or you have no idea what a good partitioning key choice would be, you should not let the partitioning key be chosen by default. The default partitioning key is the first column of the primary key, and if there is none, the first column that has an eligible data type.
  • Make sure you understand collocation and the different table join methods; see Administration Guide: Performance, topics "Join strategies in partitioned databases" and "Join methods in partitioned databases". Later on, in the Performance test, section, we'll examine queries in terms of their access plans and collocation (or lack thereof).
  • Here are the requirements for collocation. Collocated tables must:
    • Be in the same database partition group, one that is not being redistributed. (During redistribution, tables in the database partition group may be using different partitioning maps; they are not collocated.)
    • Have partitioning keys with the same number of columns.
    • Have the corresponding columns of the partitioning key be partition compatible.
    • If not in the same database partition group, be in a single partition database partition group defined on the same partition.
  • Partition compatibility is defined between the base data types of corresponding columns of partitioning keys. Partition-compatible data types have the property that two variables, one of each type, with the same value, are mapped to the same partitioning map index by the same partitioning function. Partition compatibility has the following characteristics:
    • Internal formats are used for DATE, TIME, and TIMESTAMP. They are not compatible with each other, and none is compatible with CHAR or VARCHAR.
    • Partition compatibility is not affected by columns with NOT NULL or FOR BIT DATA definitions.
    • NULL values of compatible data types are treated identically. Different results might be produced for NULL values of non-compatible data types.
    • The base data type of a User-defined type (UDT) is used to analyze partition compatibility.
    • Decimals of the same value in the partitioning key are treated identically, even if their scale and precision differs.
    • Trailing blanks in character strings (CHAR, VARCHAR, GRAPHIC or VARGRAPHIC) are ignored by the system-provided hashing function.
    • CHAR or VARCHAR of different lengths are compatible data types.
    • REAL or DOUBLE values that are equal are treated identically even though their precision differs.

Tailor the application to a partitioned environment

The use of partitioning is almost completely transparent to a DB2 application. Once you've set up the tables, your existing application programs should run without any changes.

Let's be careful, however, to distinguish application programs, consisting primarily of SQL statements, from the set of configuration and maintenance scripts and possibly programs issuing DB2 API calls, that you probably use with your database(s). You will almost certainly need to adapt those scripts/programs to the partitioned environment. The most obvious requirement for those scripts is to change command invocations and some SQL statements (which obviously apply only to one partition in a nonpartitioned environment), to apply to all or specific partitions in your partitioned environment. In most cases where such a change is required, you'll need to simply take the same command and invoke it through db2_all, causing it to be run on all partitions, as discussed in Miscellaneous tips. In other cases the changes will involve making copies of the command or statement and adding the relevant clause, often "ON DBPARTITIONNUM", to have each copy of the command or statement invoked differently on each partition. In yet other cases you will want to enhance your scripts to refer to items specific to partitioning, such as some of the catalog view columns (e.g., partition key information) listed under heading "Catalog views" in Appendix A: DB2 features applicable to DPF. In the case of API calls you may need to specify a parameter indicating the partition number(s), such as the iNodeNumber parameter for the db2GetSnapshot API.

We suggest that you review the material in Appendix A: DB2 features applicable to DPF, analyze your scripts and API calls in terms of that material, make appropriate changes, and test them out.

Getting back to application programming, there are a few issues you'll definitely want to deal with, particularly in the areas of performance and troubleshooting. Here's a summary of those key issues:

  • FOR FETCH ONLY: It is important to use this clause in SELECT statements that will not be followed by an update or delete of the row(s). The clause causes the blocking of rows when they are sent to the coordinator partition, and may significantly improve performance.
  • Directed distributed subsection (DSS): As previously mentioned, in an OLTP environment you should strive to have the processing done on as few partitions as possible. With DSS, query subsections are sent to only the appropriate partitions, and the idea is that you can rewrite your queries to make DSS happen. An example is changing a SELECT with a two-element IN list into a UNION of two SELECTs, each with an "=" predicate. Here each of the SELECTs in the UNION will be directed to one specific partition, instead of the original statement being sent to all partitions. This type of change can be appropriate where it's critical to minimize response times, but we suggest doing performance testing before applying it globally to your application. You will need to run Explain against the query (and the rewritten version)
  • Local bypass: This is a special case of distributed DSS, where all work is done on the coordinator partition. See Local bypass.
  • Buffered inserts: This technique can be used to avoid per-row communication overhead when inserting rows. See How to optimize inserts.
  • Extracting large volumes of data: Queries that return large answer sets can be slowed down by the communication of all rows back to the coordinator partition. One approach to avoid this is to run a command like the one in Usage of db2_all, in which the DBPARTITIONNUM function is used to select rows for only the current partition. Those rows can be written either to separate files that are each local to their partition, or to one combined file (e.g., in NFS). These techniques have been implemented in the db2batch command, which you can use for executing large queries in a batch environment instead of using the command line processor (db2 command); some information is provided under Commands.
  • Creating a Simulated Partitioned Database Environment: The suggestion here is to prepare for partitioning by creating tables with partitioning keys, and testing against them, without setting up a partitioned database environment. This can be a useful technique but will only help identify a fairly small subset of the issues that would be faced in full testing.
  • Troubleshooting: Some of the key points are:
    • If a partition encounters a severe error, transactions can be left in indoubt state and would need to be manually resolved.
    • Individual partitions may need to be restarted to resolve inconsistencies after severe errors.
    • An application can receive only one SQLCA. The values in it are merged from the SQLCAs from across all of the relevant partitions, according to the rules listed in the manual.

Here are two issues with narrower applicability:

  • When a failure is encountered in a partitioned database environment, all declared temporary tables that exist on the failed database partition become unusable. Any subsequent access to those unusable declared temporary tables returns an error (SQL1477N). When your application encounters an unusable declared temporary table, the application can either drop the table or recreate the table by specifying the WITH REPLACE clause in the DECLARE GLOBAL TEMPORARY TABLE statement.
  • From the topic "Scratchpads for UDFs and Methods" in the book Developing SQL and External Routines: The scratchpad only applies to a single DB2 agent (an agent is a DB2 entity that performs processing of all aspects of a statement). There is no ″global scratchpad″ to coordinate the sharing of scratchpad information between the agents.In a multiple partition database, where a statement referencing a UDF is processing data on multiple partitions, and invoking the UDF on each partition, the scratchpad would only apply to a single partition. As a result, there is a scratchpad on each partition where the UDF is executed. If you run UDFs that use scratchpads, by default there is a separate scratchpad per partition (i.e., per agent), and no sharing of scratchpad data through a "global scratchpad". If the application depends on there being a single scratchpad, use DISALLOW PARALLEL. This will cause the UDF to run on a single partition, but may have a performance penalty.

One other goal for partitioned applications is to avoid having rows retrieved back to the coordinator partition and then processed, if they can instead be processed within each partition. For example, an SQL stored procedure might select a row, and if it finds it, update it; otherwise, insert the row. This would be quite slow for a partitioned database. The faster alternative in this case would to use the MERGE statement, which could usually be processed in parallel on each partition.

Perform registry, DBM, and DB configuration

There is one set of registry variables and DBM configuration parameters for the entire partitioned instance, but each partition has its own set of DB configuration parameters for each database.

Most of the registry variables and configuration parameters should be set using the exact same criteria as for a nonpartitioned environment. The others are either specific to, or have somewhat different roles in, a partitioned environment, and those are discussed in Registry and environment variables and Configuration parameters. If there are multiple partitions on a given node you will need to size the buffer pools and database manager and database configuration parameters as appropriate to share the node's memory. (Each partition essentially runs as a separate copy of the instance, and there could also be multiple databases within each partition's "copy" of the instance.)

It is usually desirable to configure each partition identically. An exception is if you choose to use dedicated catalog or coordinator partitions (as discussed in Choose system and number of partitions). In that case you will usually want to save memory by using small buffer pools, since neither type of partition will need to do much data page I/O.

The variables of main interest for a partitioned environment are DB2_VI_ENABLE, DB2_FORCE_FCM_BP, and FCM_NUM_BUFFERS for performance, and DB2_PARTITIONEDLOAD_DEFAULT for functionality. The latter is discussed in Populate tables.

Populate tables

Populating tables with data is a key task for most databases. In both nonpartitioned and partitioned environments the LOAD command (or the db2Load API) is by far the fastest approach, but it has some limitations that are not faced by INSERT statements or the IMPORT utility. All of the approaches have special considerations when used against a partitioned database. From an end user point of view, INSERTs and IMPORT are essentially oblivious to the fact that DPF is in use, but there are important performance considerations for them: see How to optimize inserts. Now, let's continue talking about LOAD.

LOAD in a partitioned environment is discussed extensively (for about 25 pages) in "Chapter 4. Loading Data in a Partitioned Database Environment" of the Data Movement Utilities Guide and Reference. Some of the key points are:

  • As for INSERTs and IMPORT, rows being loaded must be directed to the proper partition by applying the hashing algorithm. To maximize the speed and flexibility of LOAD, hashing (partitioning) of rows is done in a phase that's separate from the actual loading of rows.
  • The LOAD command has the PARTITIONED DB CONFIG parameter for specifying partitioning options. The most important of them, MODE, lets you tell LOAD which phases you want done. The most common and default option, PARTITION_AND_LOAD, means to both hash the rows and load them. Of the other options, PARTITION_ONLY and LOAD_ONLY can be used to separate those two phases, thus creating and using, respectively, a separate file with each partition's rows. (There are other MODE options as well, which we suggest you read about if LOAD is important to you.)
  • Other options, PARTITIONING_DBPARTNUMS and OUTPUT_DBPARTNUMS, respectively, let you choose which partitions to be used to perform the partitioning phase and load into. These can allow performance to be optimized by dispersing the CPU activity across nodes, as well as increasing concurrency by having different LOAD commands load to different partitions concurrently. One thing to keep in mind is that having partitioning done on a separate partition from the loading partition will require the rows to be sent between them, which will reduce the performance gains achieved by maximizing CPU usage. You are advised to experiment with various alternatives to determine the best approach for your environment.
  • The output of partitioning phases will show how many rows were assigned to each partition, indicating if there's data skew among partitions.
  • DB2 V7 differs from DB2 V9/V8 in how loading is done in a partitioned environment. In a nutshell: In V7 each partitioned file contains a header which identifies which partition the file belongs to. That file must later be loaded to the proper partition, or an error will occur. When loading to a nonpartitioned table in a partitioned database, MODIFIED BY NOHEADER was necessary to prevent an error due to the absence of the header. In DB2 Universal database V8 or later, the partitioned file headers are not required, and by default, LOAD will be attempted on all partitions, even if the PARTITIONED DB CONFIG parameter is not specified. The DB2_PARTITIONEDLOAD_DEFAULT registry variable lets you change the default behavior of LOAD from DB2 V8 or later to pre-V8 mode, which could let you run pre-V8 scripts against DB2 V8 or later with no modifications.
  • A sample LOAD command is given in Listing 7. Sample LOAD command.

How to optimize inserts

As you should know by now, in a DPF environment every row being inserted must first have its partitioning key mapped to a partition number. Then the row must be sent from the coordinator partition to that other partition, and this transmission adds significantly to the time for each insert. There are two main approaches to reducing this overhead:

  1. Buffered inserts: In this technique, rows are buffered before being sent to the target partition, which substantially reduces communication overhead. To activate buffered inserts, PREP or BIND the application with the INSERT BUF option; in the case of IMPORT, you would PREP or BIND the import package, db2uimpm.bnd (in sqllib/bnd). You should use buffered inserts primarily when inserting a large number of rows with no intervening activity. The reason for this is that the DB2 server can only report errors for an entire block of rows at once, and so there are a number of special considerations to be dealt with in abnormal situations. For information on buffered inserts with IMPORT, see the Data Movement Utilities Guide and Reference.
  2. Another possibility, which is not limited to inserts, is described in Local bypass.

Local bypass

Local bypass is the term used to describe DB2 being able to satisfy an entire request within the coordinator partition, which is ideal for optimal performance for OLTP and some "batch" workloads.

There are three requirements for local bypass to occur:

  1. The proper coordinator partition for the statement(s) must be identified. The main step here is to use the sqlugrpn API to convert a partitioning key value to a partition number. Prior to that, the sqlugtpi API can be used to obtain the partitioning key information for a table, which is input to sqlugrpn. Another way to do something similar is to use the LOAD command in "partition_only" mode to create files that are split by partition, according to hashing algorithm.
  2. The application must be connected to the partition where the work is to be done, as indicated by the partition number from sqlugrpn. This can be done either by explicitly establishing a new connection or, preferably, by having the client pre-establish and maintain a set of connections, one (or more) per partition, and picking the correct one for each new transaction. There are various mechanisms to connect to a particular partition:
    • the SET CLIENT CONNECT_DBPARTITIONNUM command; for an example, see Listing 3. Connecting or attaching to a specific partition
    • in CLI/ODBC, the SQL_ATTR_CONNECT_NODE connection attribute or the CONNECTNODE configuration keyword (set to the partition number, or SQL_CONN_CATALOG_NODE for the catalog partition)
    • for Java, the Connection Router referenced in Resources
  3. The SQL statement(s) must be compatible with all work being done in a single partition. This requires both the statements themselves to be appropriate (e.g., "select * from partitioned_table" can never be executed via local bypass), and the partitioning key values of the rows involved in all statements to map to the same partition; i.e., there must be collocation. For example, "select * from t1 where c1=?" and "select * from t2 where c2=?" would only be executed against the same partition if the query "select * from t1, t2 where t1.c1=? and t1.c1=t2.c2" were collocated.
    Another example is "insert into t1 select * from t2": if t1 and t2 have the same partitioning key the statement will be executed locally and much faster than if they have different partitioning keys.
    Watch out for implicit processing that may take place on different partitions. For example, if an INSERT statement is executed on partition 1, it may cause a trigger to be invoked, or foreign key constraint checking to be done, on one or more other partitions. In both of these cases you should try to design your partitioning keys such that related tables are collocated, but given the multiplicity of possible interactions between tables it may be difficult for 100% local bypass to occur without simplifying the application. Even if 100% of the activity can't be done within the coordinator partition, you should try to have as much as possible done there.

Performance test

After, or ideally in parallel with, functional testing, you'll need to evaluate the performance of your partitioned application and database. Much of your performance testing in a partitioned environment should be carried out in the same way as for a nonpartitioned environment: run the application workload, gather throughput and response time indicators of performance as perceived by end users, as well as DB2 and operating system monitoring data, then respond appropriately if the results are inadequate.

Most of the ways to respond to poor performance are the same as for nonpartitioned databases, such as larger buffer pools, adequate sort heaps, proper indexes, and ensuring optimal access plans. It's primarily on the differences for partitioning that you'll want to focus after your migration. Those differences really boil down to (a) communication between partitions: how much is occurring, is it because of bad access plans, and how to fix it; and (b) how does your application scale as data and/or partitions are added. We'll look more at these in a minute, but first let's review which tools are available and how you should use them:

  • DB2 snapshots: Snapshots capture information at various levels: DBM, database, tablespace, table, buffer pool, and application. The counters are primarily useful for tuning things like buffer pool sizes and configuration parameters. They'll also reveal if there are any concurrency issues. Snapshots also show statement-level performance data in the Dynamic SQL Snapshot, and from scanning through this you can find which statements have been executed the most, or took the longest total time. Those statements are obviously the ones to focus on for further study. By default, a snapshot is taken only on the partition you are attached to, so it's strongly recommended that you use the GLOBAL option to get an aggregate view of the entire instance or database.
  • DB2 event monitoring: This can provide detailed deadlock information, but hopefully you don't encounter those anyway, so the main value of event monitoring is usually is in capturing the elapsed time and some performance statistics for each SQL statement. In the partitioned world there is added complexity in that statement-level monitoring can only be done within one partition per monitor, so you should monitor on at least the coordinator partition and ideally others as well, to get a full picture.
  • DB2 Explain: Through snapshots and event monitoring you should be able to find the worst performing statements, and then you can use Explain to study them. Remember the role of collocation, and the two other join types: directed and broadcast. See Administration Guide: Performance, topic "Join methods in partitioned databases" for examples of the different join methods.
  • db2batch: This is an excellent tool for running individual queries and capturing elapsed times and snapshot information. Suggested usage is
    "db2batch -d sample -f my_stmts.sql -i complete -o p 5"
    You should also consider the "-p" option for different output locations with regard to partitions. Also, the performance of queries can depend on which partition they are run from, so you can try running them through db2batch with various partitions as coordinator (use the SET CLIENT command).
  • Operating system tools: Obviously these are platform-dependent, but you'll want to use tools such as AIX's vmstat, iostat, and netstat for providing CPU, memory, disk utilization. For a partitioned environment it's extremely useful to have a monitor with windows displaying all or many partitions concurrently. This lets you can see when each partition is active and you can tell right away what amount of parallel activity is being done.
  • RUNSTATS: Don't forget to run it against all tables before doing any significant performance testing.

Inter-partition communication: With the above tools you can find and analyze the worst SQL statements. The key information source for studying the communication that occurs for those statements is the FCM usage portion of a DBM snapshot. Ideally you'd use db2batch to run individual queries and capture that snapshot information; by doing it this way you avoid being confused by activity from other work. What you're looking for in the snapshot are high numbers of buffers being sent between partitions, which would primarily be a reflection of directed or broadcast joins. You could take actions such as creating replicated tables or using different partitioning keys, then try again.
In addition to studying individual queries it's also interesting to accumulate the FCM usage over a long period of multi-user activity and see which partitions are doing the most work. For an example of a DBM snapshot (without the GLOBAL option), see Listing 6. DBM snapshot: FCM usage and partition status.

Scalability testing: This is a very important task, since improving performance or scalability is usually the reason why you're using partitioning at all. A suggested starting point for studying scalability is to implement your application on one, two, and four partitions and see how the performance differs. One decision to be made is whether to keep the total amount of data fixed as you add partitions, or to have same amount of data per partition. Both are valid things to study, and it depends on which is more likely to match your application's usage. Another decision to make is what hardware to use and where the partitions will be. Obviously, if you run against one partition and your application runs at close to 100% CPU utilization, adding three more partitions on the same node will more than likely cause a major CPU bottleneck and you won't get a true picture of scalability. The ideal situation, therefore, is to have four identical systems with the fastest possible communication between them. This implies that a proper scalability study can require a fairly substantial investment.

As you go through your scalability exercise, remember not to expect good scalability for all tasks; for example, as mentioned earlier, LOAD may in fact run slower with two partitions than one, because of the extra work that must be done.



Back to top


Operational and miscellaneous considerations

Backup and recovery

When your testing is complete and before you put your application into production, you should make sure you have a solid backup and recovery strategy in place. Guidance on this is provided in the Data Recovery and High Availability Guide and Reference, especially topics "Recovering from Transaction Failures in a Partitioned Database Environment", and "Recovering from the Failure of a Database Partition Server".

BACKUP and RESTORE are obviously very important commands and, like most commands, they operate against a single partition at a time. As mentioned earlier, both commands must completely process the catalog partition before starting on the other partitions. The best way to make them run as quickly as possible is with two commands such as the following:

	db2_all '<<+0<   db2 BACKUP DATABASE wsdb TO /dev3/backup'
	db2_all '||<<-0< db2 BACKUP DATABASE wsdb TO /dev3/backup'   

The first command runs against partition 0 only, as requested by "<<+0<".
The second command runs against all partitions except partition 0, as requested by "<<-0<", and in parallel on those partitions, thanks to the "||".

The catalog partition is particularly critical in a partitioned database, and you need to be very careful to do frequent backups and ensure that there is no single point of failure for tablespace and log disks.

High Availability (HA): HA is a term used to describe systems that (more or less) never go down on an unplanned basis. It is beyond the scope of this article to discuss HA in detail, but here are a few things to consider:

  • HA is similar in nonpartitioned and partitioned database environments, but with the latter there are usually multiple physical systems already in place, and these are typically used to have DB2 run in mutual takeover mode. For example, if you have node A with partitions 0 and 1, and node B with partitions 2 and 3, you would usually set up HA to have node A take over from node B if the latter goes down, and vice-versa. Each node would have twice the load and performance would suffer, but the full database would still be available.
  • DPF can run if one or more partitions is down (even without HA), but the data on the down partition(s) will not be available. You never have to worry about DB2 giving incorrect results, as it will return an error (such as SQL1229) if it cannot access a partition that is needed to satisfy a query. A SELECT against a catalog view can succeed even if only the catalog partition is up.
  • See the Data Recovery and High Availability Guide and Reference for more information on HA.

Ongoing maintenance

After a partitioned database is in production, most of the ongoing maintenance tasks for it are the same as for a nonpartitioned database. RUNSTATS, REORGCHK, REORG, BACKUP, and performance monitoring should all be performed, and with similar frequency as for nonpartitioned databases.

Fixpaks will need to be applied from time to time, and the difference in a partitioned environment is that they'll need to be installed on every node, not just one.

As tables grow over time, some of them may become increasingly unbalanced in terms of how many rows are on each partition. To detect this situation, we suggest that you occasionally run queries as in DBPARTITIONNUM example. If large tables have more than about 10% more rows on some partitions than others, you may want to redistribute their database partition groups.

For information on redistribution, see "Chapter 14. Data Redistributing" in the Performance Guide. There you'll see details on what happens during redistribution. The key points are that while each table is being redistributed it is locked exclusively, rows are deleted and inserted (involving a communication of the row contents between partitions), logging of the deletes and inserts occurs, and a COMMIT is done when the table is completed. In other words, there is substantial overhead, and therefore significant elapsed time, to do a redistribution. You must be sure to have enough logging space, both in the partitions losing rows, and the partition(s) gaining rows; guidance is provided in the chapter just cited.

As tables continue to grow, the load may outgrow the resources and performance may become inadequate. One solution is to add additional systems to the partitioned instance (in the same manner as when initially creating an instance), create new partitions on those systems, alter the existing database partition groups as appropriate to reflect the new partitions, and redistribute. Details on this process are in the book, Administration Guide: Implementation. This can take substantial time, particularly for the redistribute, as just discussed. The overhead of redistribution varies with the number of rows redistributed (obviously), which in turn is partially dependent on the percentage of partitions added, so going from 8 to 12 partitions (+50%) will take much longer than going from 24 to 28 (+17%), all else being equal. We suggest you try a few redistributions with small amounts of data, to see how long they will take in your environment.

An alternative approach to adding partitions and doing redistribution is to add additional hardware, particularly CPUs, to the existing configuration, and do not add partitions. This approach avoids redistribution, but can't be done repeatedly. Another alternative to redistribution is to use EXPORT or the High Performance Unload tool to empty table contents to flat files, drop the table(s), change the database partition groups, recreate the table(s), and reload. In a data warehouse environment where data might be routinely reloaded from scratch, you may not have to unload the data first.

Two other situations you may be faced with are having more partitions than necessary, or wanting to consolidate partitions because of an upgrade to more powerful hardware. The chapter just mentioned ("Scaling your configuration") also covers removing partitions. Basically, you have to remove the data from the partition(s) first, which can be done with the same approaches (redistribution or unload/reload) as for adding partitions. An additional consideration for consolidating partitions is that if you're using a large RS/6000 system with many LPARs, it's better to put multiple partitions into one LPAR rather than spreading them across LPARs. The advantage of this is that you can use the DB2_FORCE_FCM_BP registry variable to have communication between partitions in the same LPAR go through shared memory, instead of the UNIX sockets which are necessary between LPARs, thus improving performance. Another thing to think about in a consolidation situation is that you may be able to simply re-cable disks and have your database move to the new hardware without either redistribution or unload/reload.

Miscellaneous tips and topics

  • Once you have a partitioned instance, running the db2sampl command to create the sample database will give you a nice little environment for experimentation. All user tables will be partitioned, and their partitioning keys will be the defaults.
  • A quick way to determine the catalog partition for a database is to run LIST DB DIRECTORY on the server; see example below.
  • To see which partitions are active, take a DBM snapshot. For an example, see Listing 6. DBM snapshot: FCM usage and partition status.
  • Two extremely important commands for use in a DPF environment are db2_all and rah. db2_all allows you to execute a command against all partitions in an instance, while rah does the same but for all systems in the instance. (If each partition is on its own system, the two commands have the same effect.) They also allow you to run commands on subsets of partitions/systems, and on multiple partitions/systems concurrently. These are handy not only for running DB2 commands, but also operating system commands to do such tasks as making directories (mkdir command) and looking at information for, say, log files (ls or dir). For more information on these commands, search the Information Center or see Appendix C of the Adminstration Guide: Implementation.
  • Be sure you understand the scope of DB2 commands and SQL statements. In general, command invocations apply to only one partition; check this by looking under the heading "Scope:" in the discussion of the command in the Command Reference. On the other hand, SQL statements generally affect all partitions, with a few exceptions such as CREATE BUFFERPOOL and CREATE TABLESPACE, which can optionally be run against a subset of partitions.
  • Two very important examples of needing to know the scope of commands are:
    • The UPDATE DB CFG command applies only to one partition at a time, and failure to understand (or remember) this has caused many DBAs a lot of grief. For example, to set the value of SORTHEAP on every partition you must use db2_all, as in:
      "db2_all db2 update db cfg for sample using SORTHEAP 10000".
    • The db2empfa command, used to improve performance by activating multi-page file allocation for SMS tablespaces, applies only to the partition on which it is executed, so it should be run as in:
      "db2_all db2empfa sample".
  • The LIST APPLICATIONS command gives single-partition output by default, which can be confusing. Use the "global" option.
  • The partition number appears in various places. Some examples are:
    • The output of "ps -ef" on AIX: if partitions 0 and 1 are on the node, the output will show "db2sysc 0" and "db2sysc 1" for the system controller processes for the partitions, and lines with contents like "db2dlock (SAMPLE) 0", which in this case identifies the deadlock detector process for the SAMPLE database on partition 0.
    • db2start and db2stop will have a line of output for each partition, with the column after the timestamp indicating the partition number.
    • In db2diag.log and notify file entries, the partition number is reported like "Node:002".
  • It is not desirable to have DB2 write a high volume of diagnostic output, such as to db2diag.log, but if this is happening, you should consider setting the DIAGPATH parameter (DBM) to point to a location that is local to each partition. This will avoid the overhead of network traffic to write the messages, but forces you to look at multiple files to get the complete picture. (On the other hand, it is easier to understand the messages when they pertain to only one partition per file, and different partitions will often have the same messages).
  • Use the SET CLIENT command or set the DB2NODE environment variable to use a specific partition; see Commands or Registry and environment variables, respectively.
  • The most common source of stability issues with DPF is in communication problems between partitions. Strive to have the most robust communication configuration you can.
  • RUNSTATS: In a partitioned database, statistics are gathered only for the partition the command is run on, and extrapolated (based on an assumption of uniform distribution of rows across partitions) to reflect the full database. This means that, for example, the CARD column in SYSCAT.TABLES will in general not contain the exact number of rows in the table. As always, it is good to have uniform distribution.
  • The DB2 Integrated Cluster Environment (DB2 ICE) for Linux "is a completely integrated, high-performance and pre-tested solution that incorporates best-of-breed software, hardware and services. It provides you with a high performance, reliable data management system that can scale from 2 to 1,000 nodes." It is based on DB2 partitioning through DPF. For more information, see the link in Resources.



Back to top


Conclusion

In this article we started by reviewing what partitioning is, then discussed the benefits of partitioned databases and how you can decide whether to migrate your application to a partitioned environment. We then covered the steps to go through in doing such a migration. A variety of other information is also presented to help you understand the unique aspects of partitioning. We hope that this article has been helpful, and wish you good luck in your adventures with DB2 partitioning.

Acknowledgements

The authors would like to thank Greg Holton for reviewing this article and making several excellent suggestions.



Back to top


Appendix A: DB2 features applicable to DPF

In this appendix we list the various SQL statements, functions, commands, and other features of DB2 that are either specific to DPF or have particular pertinence to it.

SQL statements

Why must partitioning keys be a subset of the unique index columns?
Are you wondering why this restriction exists? Suppose it didn't exist, and table T1 has partitioning key (C1,C2) and a unique index on C1 (which isn't actually allowed, because it's not a superset of partitioning key (C1, C2)). If a row with (C1,C2)=(1,100) is in partition 1 and a new row is inserted with (C1,C2)=(1,500), hashing its partitioning key might result in it being assigned to partition 2. If the INSERT were allowed to complete, there would be two rows with a value of 1 for C1, which would be a duplicate key violation. The problem is that DB2 could only prevent that violation by scanning every other partition (besides partition 2) for the same key, which would add high overhead to every INSERT. So, the restriction is in place as a low overhead means of ensuring uniqueness.
  • ALTER/CREATE BUFFERPOOL -- ALTER allows you to change the size of a buffer pool in one or all partitions, or add a buffer pool to a new database partition group.
  • ALTER/CREATE DATABASE PARTITION GROUP -- ALTER allows you to add or drop partitions to or from a database partition group.
  • ALTER/CREATE TABLE -- These considerations apply to a DPF environment:
    • With ALTER you can add or drop a partitioning key, but only for a nonpartitioned table (this limitation is a reflection of the fact that a partitioned table must always have a partitioning key). Defining a partitioning key for a nonpartitioned table is allowed, which can enable you to have one version of a table definition for both partitioned and nonpartitioned environments.
    • You should use the PARTITIONING KEY clause to specify the partitioning key for a partitioned table, instead of letting the default choice be used. You can also define a partitioning key for a nonpartitioned table for the reasons just mentioned.
    • The set of columns in any unique or primary key constraint defined on the table must be a superset of the partitioning key, if there is one. For example, if the partitioning key for table T1 contains columns C1 and C2, any unique or primary key constraint (or unique index) on T1 must include C1 and C2. Putting this another way, if you have an existing primary key and/or unique index(es) on a table, your choice of partitioning key is limited to columns that are in the primary key and all unique indexes. For more information, see the sidebar.
    • An IDENTITY column can be part of a partitioning key. Also, see CREATE SEQUENCE (below) for how IDENTITY or SEQUENCE values are handled in a DPF environment.
    • With the REPLICATED keyword you can create a table as replicated (as defined above under DPF concepts and terminology).
    • In CREATE, if you use the INDEX IN or LONG IN clause, the index or long tablespace must be in the same database partition group as the base table.
  • ALTER TABLESPACE -- You can use this statement to add a container to an SMS tablespace in a partition where the tablespace currently has no containers. You can also change a DMS tablespace in the same ways as for a nonpartitioned database, but on one or more specific partitions.
  • CONNECT TO <DATABASE> IN EXCLUSIVE MODE ON SINGLE DBPARTITIONNUM - You can use "IN EXCLUSIVE MODE", as for a nonpartitioned database, to restrict access to the database to users with the same authorization ID, but by adding the optional "ON SINGLE DBPARTITIONNUM" clause you cause only the coordinator partition for this connection to be restricted in this way.
  • CREATE EVENT MONITOR -- Using the "ON DBPARTITIONNUM <partition number>" clause causes the monitor to run on and have its output written to the specified partition. If LOCAL is specified or defaulted to, only the running partition's events are recorded. If GLOBAL is specified, events on all partitions are recorded, but this is only supported for deadlocks. If WRITE TO TABLE is specified, all partitions' work is recorded in the table.
  • CREATE FUNCTION -- Because each function typically needs to run on every partition, the body of every external function should be in a directory that is available on every partition of the database.
  • CREATE INDEX -- The relationship (described under ALTER TABLE) between unique or primary key constraints and partitioning keys applies.
  • CREATE PROCEDURE -- As for functions, the body of every external stored procedure should be in a directory that is mounted and available on every partition of the database.
  • CREATE SEQUENCE -- SEQUENCE and IDENTITY values are cached locally on each partition, and when the cache is exhausted (default size = 20), a new set of values is obtained from the catalog partition. Larger caches can reduce this overhead but tend to enlarge the number of unused values.
  • CREATE TABLESPACE -- You need to specify the database partition group to create the tablespace in, unless you want it to default to IBMDEFAULTGROUP (or IBMTEMPGROUP for system temporary tables). Containers can be defined for a single partition or a set of partitions at a time, and to simplify tedious statement writing you can use a database partition expression: " $N" gets replaced by the partition number on every partition that the CREATE TABLESPACE statement is executed on.
  • DECLARE GLOBAL TEMPORARY TABLE -- These temporary tables can be partitioned.
  • INSERT -- Substantial performance improvements can be obtained by using buffered inserts for high volumes; see Buffered Inserts.

SQL functions

  • DBPARTITIONNUM (renamed from "NODENUMBER" in V8, but the old name is still recognized) -- Use this function to obtain the partition number that a row is in. The name of a column in the table must be specified, but the output is the same regardless of which column is named. For sample usage, see DBPARTITIONNUM example.
  • HASHEDVALUE (renamed from "PARTITION" in V8) -- Use this function to obtain the partition map index (0-4095) that a row hashes to. The name of a column in the table must be specified, but the output is the same regardless of which column is named. For sample usage, see HASHEDVALUE example.
  • GENERATE_UNIQUE -- This generates a character string which is guaranteed to be unique, even across partitions and for all rows of a multi-row INSERT.
  • SNAPSHOT_XXXX functions -- Most of these functions (for example, SNAPSHOT_DATABASE) have an argument that allows you to specify the partition number that you want the function executed against. Most of them default to the current partition, but can optionally be run against all partitions.

Special register

  • CURRENT DBPARTITIONNUM -- This returns an INTEGER value that identifies the coordinator node number for the statement.

SQLCA

  • The SQLCA provides a merged view of the activity for a given statement on all partitions. For all errors and warnings, the sqlwarn field contains the warning flags received from all agents. Values in the sqlerrd fields indicating row counts are accumulations from all agents.
  • One particular field of interest is sqlerrd(6). For a partitioned database, it contains the partition number of the partition that encountered the error or warning.
  • For more information, see topic "SQLCA usage in partitioned database systems" in the SQL Reference, Volume 1.

Catalog views

  • SYSCAT.BUFFERPOOLDBPARTITIONS -- By default a buffer pool has the same number of pages on each partition, and that number is contained in SYSCAT.BUFFERPOOLS, in column NPAGES. For each partition with a non-default buffer pool size there's a row in SYSCAT.BUFFERPOOLDBPARTITIONS.
  • SYSCAT.COLUMNS -- Column PARTKEYSEQ contains the column's numerical position within the table's partitioning key, (null or 0 if the column is not part of the partitioning key).
  • SYSCAT.DBPARTITIONGROUPDEF -- Contains a row with the number and status of each partition that is contained in a database partition group.
  • SYSCAT.DBPARTITIONGROUPS -- Contains a row for each database partition group, including a pointer to its partitioning map.
  • SYSCAT.EVENTMONITORS -- Column DBPARTITIONNUM indicates the partition that the monitor runs against and logs events for.
  • SYSCAT.PARTITIONMAPS -- Contains a row for each partitioning map.
  • SYSCAT.TABLES -- Column PMAP_ID holds the ID of the partitioning map used for the table. Column PARTITION_MODE contains "H" if the table is partitioned, "R" if it's replicated.
  • SYSCAT.TABLESPACES -- Column DBPGNAME contains the name of the database partitioning group for the tablespace.

Commands

Not included in the following list of DB2 system and CLP commands are many commands that apply only to a single partition per invocation. Typically those commands fall into one of these categories: (a) They apply only to the partition that the application is attached or connected to. For example, GET DATABASE CONFIGURATION or LIST TABLESPACES. (b) They have a DBPARTITIONNUM option to specify the partition number to run the command against. For example, LIST APPLICATIONS AT DBPARTITIONNUM 4.

  • db2_all, rah - See the discussion on these above, under Miscellaneous tips.
  • db2batch - Benchmark tool. Various values can be used with the "-p" option to request different optimization techniques for SELECT processing for large answer sets, such as writing the output to a local file on each partition.
  • db2icrt - Create an instance. There are several special options for DPF on Windows.
  • db2iupdt - On Windows, this allows you to to convert an instance from single-partition to multi-partition.
  • db2nchg - Change partition server config (Windows only).
  • db2ncrt - Add DB partition server (Windows only).
  • db2ndrop - Drop partition server (Windows only).
  • db2start ... add dbpartitionnum - Start DB2 and add a new partition to an existing instance and database(s).
  • DROP DBPARTITIONNUM VERIFY - Verify that a partition is unused before dropping it.
  • ADD DBPARTITIONNUM - Creates a new empty partition in each database in the instance.
  • LIST DATABASE PARTITION GROUPS - Lists the database partition groups in the database the application is connected to.
  • LIST DBPARTITIONNUMS - Lists the partition numbers in the database the application is connected to.
  • LOAD - Load data into a table, either on one partition or any subset of the table's partitions. See Populate tables.
  • RESTRIBUTE DATABASE PARTITION GROUP - Move rows between the partitions of a database partition group. Most commonly used to evenly distribute the rows after adding partition(s). Can also be used to move rows out of a partition before dropping it.
  • SET CLIENT - Use this to set the partition to be used for a subsequent CONNECT or ATTACH; can supply a specific partition number, or CATALOG_DBPARTITIONNUM for the catalog partition.
  • QUERY CLIENT - Displays the current value of CONNECT_DBPARTITIONNUM and ATTACH_DBPARTITIONNUM, if these were set by SET CLIENT or by setting DB2NODE.

Administrative APIs

Essentially, the APIs provide the same functionality and options as the corresponding DB2 commands. Look for parameters that allow a partition number to be supplied. The APIs of interest specifically for DPF are:

  • sqleaddn - Add Node
  • sqledrpn - Drop Node Verify
  • sqludrdt - Redistribute Database Partition Group
  • sqlugrpn - Get Row Partitioning Number
  • sqlugtpi - Get Table Partitioning Information

Registry and environment variables

These variables are unique to a DPF environment. All except DB2NODE are registry variables.

  • DB2NODE (environment variable) -- Set this before CONNECT or ATTACH to connect or attach to a specific partition.
  • DB2_VI_ENABLE - On Windows (only), enables the Virtual Interface (VI) Architecture communication protocol for faster communication between partitions.
  • DB2_FORCE_FCM_BP - On AIX (only), causes faster communication between partitions on the same physical system by using shared memory rather than UNIX sockets.
  • DB2_PARTITIONEDLOAD_DEFAULT - Allows you to change the default behavior for LOAD (load on multiple partitions or not), when no DPF-specific parameters are used in the command.
  • DB2_SORT_AFTER_TQ - Influences the optimizer on whether to sort rows at the sending or receiving end when transmitting them between partitions.

Configuration parameters

These variables (DBM or DB, as shown) have special significance in a DPF environment. Remember that each partition has its own set of database configuration parameter settings and these should normally be set to the same value on every partition.

  • CONN_ELAPSE (DBM), MAX_CONNRETRIES (DBM) - When one partition tries to connect to another within the same database, DB2 will wait CONN_ELAPSE seconds before giving up and trying again. It will try this MAX_CONNRETRIES times before quitting and issuing an error.
  • FCM_NUM_BUFFERS (DBM) - This specifies the number of 4KB buffers that are us