Effectively use DB2 data movement utilities in a data warehouse environment

Easy ways to refresh your non-production data warehouse from production

Choosing proper data movement utilities and methodologies is key to efficiently moving data between different systems in a large data warehouse environment. To help you with your data movement tasks, this article provides insight on the pros and cons of each method with IBM® InfoSphere® Warehouse, and includes a comparative study of the various methods using actual DB2® code for the data movement.

Share:

Mohankumar Saraswatipura (mohankumarsp@gmail.com), DB2 Database Administrator, Reckitt Benckiser plc

Mohankumar Saraswatipura photoMohan works as a lead DB2 Database Administrator at Reckitt Benckiser plc focusing on balanced warehouse business intelligence solutions, application performance tuning, and DB2 implementation/migration. While with the IBM Software Lab, India, he worked with the High Performance On Demand Solutions (HiPODS) team to provide the performance solutions to clients. Earlier he worked with the DB2 product development team, India. Mohan is an IBM certified DB2 Advanced Database Administrator, DB2 Application Developer and DB2 Problem Determination Master. Mohan completed his M Tech (Master of Technology) in computer science in the year 2004 and Executive MBA from IIMC in the year 2007.



17 November 2011

Also available in Chinese Spanish

Introduction

The data movement utilities in DB2 are simple to use with a single partition database environment. However, there are extra considerations when it comes to large data warehouse environment having terabytes of STAGING, STORE, and DATAMART tables on Database Partition Feature (DPF) environment of InfoSphere Warehouse. In this article we will look at the data movement options available in DB2 and the best suitable methods for data warehouse environment.

DB2 data movement choices

The table below lists the data movement options available in DB2 9.7, along with a description of how each is commonly used and an example.

Table 1. Data movement options available in DB2
Utility namePurposeUsage practiceExample
ExportTo export data from a database table to a fileThis method can be used if you want to keep the table data in a file for future use or to refresh another environment from the current data.EXPORT TO DATAMART. F_CUST_PROF.DEL OF DEL MESSAGES DATAMART. F_CUST_PROF.EXPORT.MSG SELECT * FROM DATAMART.F_CUST_PROF;
LoadTo perform fast data inserts into a existing table in the databaseThis is the utility you are looking if your primary concern is data insert performance. It inserts formatted pages into the database instead of a row-by-row insertion. The database administrator/user can also choose not to log the activity in transaction logs. But be aware, this utility has the capability to fully exploit the system resources.LOAD FROM DATAMART. F_CUST_PROF.DEL OF DEL SAVECOUNT 10000 MESSAGES DATAMART. F_CUST_PROF.LOAD.MSG INSERT INTO DATAMART.F_CUST_PROF;
ImportTo insert data from a file to a table or viewThis utility is beneficial when inserting data to a view and table that has a constraint, and when your intention is not to place the table in 'set integrity' state. Also, it is useful if you have triggers and you want them to be fired while inserting data.IMPORT FROM DATAMART. F_CUST_PROF.DEL OF DEL COMMITCOUNT 1000 MESSAGES DATAMART. F_CUST_PROF.IMPORT.MSG INSERT INTO DATAMART. F_CUST_PROF;

If you have an IXF format data file, you can create a table and insert data in one single command, if the table doesn't exist in the target environment
db2moveCopy tables from one environment to another at schema level (multiple tables in general)When you have many tables that need to be copied between environments based on the schema, this can easily be achieved using db2move with the COPY option.Export all the tables from STAGING schema:
db2move SourceDB EXPORT –sn STAGING

Export all the tables from the database:
db2move SourceDB EXPORT

Import all the table data into a target database:
db2move TargetDB IMPORT

Copy data from source to target: db2move SourceDB COPY –co TARGET_DB TargetDB USER <username>USING <password>

Data movement challenges

Database administrators often find it difficult to copy a large volume of data from one database server to another over the network. Some of the major challenges to accomplish this task are:

  1. The volume of data
    1. Terabytes of data
    2. Hundreds of tables
    3. Tables with hundreds of millions of records along with thousands of range partitions
  2. The need for faster data transfer and data reloads
  3. The requirement to evenly distribute the data across the database partition nodes

Now that you have an overview of movement options in DB2, take a detailed look at techniques with the actual code for data movement.


DB2 warehouse data movement techniques

In this section, we will examine the techniques available to move a DATAMART from source to target database. We will evaluate the pros and cons of each method, with the goal of easing the work for database administrators during data transfer.

The examples were tested on an IBM Balanced Warehouse D5100 with the following configuration:

  • 11 physical nodes and 41 logical nodes
  • Each server has 4 CPUs, 2800 MHz
  • Each server has 32 GB RAM
  • SUSE Linux 10 2.6.16.60-0.66.1-smp
  • DB2 9.5 FP7
  • Database size: 6.8 TB
  • Major fact tables are range partitioned and the number of partitions in each table is about 21237

The data copy was from the production of Balanced Warehouse to the user acceptance test (UAT) Balanced Warehouse of a similar infrastructure and DB2 version.

As a business requirement, the two UAT database schemas, DATASTORE and DATAMARTS, were refreshed from production.

As discussed earlier, there are many techniques available in DB2 to refresh large data sets from one environment to another. Those techniques are:

  1. Export data on local database server, transfer data file and load data locally on target database server (small tables with hash partitioning, or non-hash partitioned)
  2. Export data from local database server and remotely load data to target database server
  3. Export data from remote database server and locally load data to target database server
  4. Export data to an operating system pipe and import or load data from pipe to target remote database server
  5. Export data on local database server in parallel (each part residing in its respective partition file system), use a data file transfer, and then load the parts locally in parallel

The following sections examine each of these techniques, with examples.

Technique #1

Export data on local database server, transfer data file and load data locally on target database server (small tables with hash partitioning or non-hash partitioned)

Figure 1. Technique #1
diagram: export data from source database, copy file over network, then import/load at target database

Here are the steps to implement this technique.

  1. Connect to the SourceDB locally on the source database server.
     CONNECT TO SourceDB;
  2. Perform a DB2 export on table in source database server.
     EXPORT TO DATAMARTS.SCENARIO_CALENDAR.DEL OF DEL MESSAGES 
    DATAMARTS.SCENARIO_CALENDAR.MSG
     SELECT * FROM DATAMARTS.SCENARIO_CALENDAR;
  3. Compress the exported file to shorten the time required for file transfer between the servers.
     gzip DATAMARTS.SCENARIO_CALENDAR F.DEL
  4. Transfer the compressed file from SourceDB server to TargetDB server using sftp or scp.
     cd <export file path> 
    sftp username@<targetDB Server hostname> put DATAMARTS.SCENARIO_CALENDAR.DEL.gz OR scp DATAMARTS.SCENARIO_CALENDAR.DEL.gz username@<targetDB Server hostname>:/<PATH>
  5. Uncompress the transferred file at the target database server.
     gunzip DATAMARTS.SCENARIO_CALENDAR.DEL.gz
  6. Connect to the TargetDB locally on the target database server.
     CONNECT TO TargetDB;
  7. Perform a load or import.
     LOAD FROM DATAMARTS.SCENARIO_CALENDAR.DEL OF DEL SAVECOUNT 10000 MESSAGES
             DATAMARTS.SCENARIO_CALENDAR.LOAD.MSG INSERT INTO 
    DATAMARTS.SCENARIO_CALENDAR;
  8. In case of load command, perform a SET INTEGRITY at the end of the operation.
      SET INTEGRITY FOR DATAMARTS.SCENARIO_CALENDAR IMMEDIATE CHECKED;
  9. Perform a RUNSTATS to keep the statistics up to date.
    RUNSTATS ON TABLE DATAMARTS.SCENARIO_CALENDAR WITH DISTRIBUTION AND DETAILED INDEXES
    ALL;

Technique #2

Export data from local database server and remotely load data to target database server.

Figure 2. Technique #2
diagram: export data from source database, and remotely load to target database

Follow these steps to implement this technique.

  1. Catalog the target database in the source database server.
    CATALOG TCPIP NODE TargetND REMOTE TargetDBServer.ibm.com SERVER 50001;
    CATALOG DATABASE TargetDB AT NODE TargetND;
  2. Connect to the SourceDB locally on the source database server.
     CONNECT TO SourceDB;
  3. Perform a DB2 export from table on the source database server.
     EXPORT TO DATAMARTS.SCENARIO_CALENDAR.DEL OF DEL MESSAGES 
    DATAMARTS.SCENARIO_CALENDAR.msg
    SELECT * FROM DATAMARTS.SCENARIO_CALENDAR;
  4. Connect to the TargetDB remotely on the target database server.
    CONNECT TO TargetDB user <username> using <Password>;
  5. Perform a load or import remotely from source to target.
    LOAD CLIENT FROM DATAMARTS.SCENARIO_CALENDAR.DEL OF DEL SAVECOUNT 10000 MESSAGES 
           DATAMARTS.SCENARIO_CALENDAR.LOAD.msg INSERT INTO DATAMARTS.SCENARIO_CALENDAR;
  6. In case of load command, perform a SET INTEGRITY at the end of the operation.
    SET INTEGRITY FOR DATAMARTS.SCENARIO_CALENDAR IMMEDIATE CHECKED;
  7. Perform a RUNSTATS to keep the statistics up to date.
     RUNSTATS ON TABLE DATAMARTS.SCENARIO_CALENDAR WITH DISTRIBUTION 
    AND DETAILED INDEXES ALL;

Technique #3

Export data from remote database server and locally load data to target database server.

Figure 3. Technique #3
diagram: export data from source database over the network, then locally import or load at target database

Follow these steps to implement this technique.

  1. Catalog the source database in the target database server.
     CATALOG TCPIP NODE SourceND REMOTE SourceDBServer.ibm.com SERVER 50001;
    CATALOG DATABASE SourceDB AT NODE SourceND;
  2. Connect to the source database remotely from the target server.
    CONNECT TO SourceDB user <username> using <password>;
  3. Perform a DB2 export from table remotely.
    EXPORT TO DATAMARTS.SCENARIO_CALENDAR.DEL OF DEL MESSAGES 
    DATAMARTS.SCENARIO_CALENDAR.msg
            SELECT * FROM DATAMARTS.SCENARIO_CALENDAR;
  4. Connect to the TargetDB locally on the target database server.
     CONNECT TO TargetDB user <username> using <Password>;
  5. Perform a local load or import.
     LOAD FROM DATAMARTS.SCENARIO_CALENDAR.DEL OF DEL SAVECOUNT 10000 MESSAGES 
         DATAMARTS.SCENARIO_CALENDAR.LOAD.msg INSERT INTO DATAMARTS.SCENARIO_CALENDAR;
  6. In case of load command, perform a SET INTEGRITY at the end of the operation.
     SET INTEGRITY FOR DATAMARTS.SCENARIO_CALENDAR IMMEDIATE CHECKED;
  7. Perform a RUNSTATS to keep the statistics up to date.
      RUNSTATS ON TABLE DATAMARTS.SCENARIO_CALENDAR WITH DISTRIBUTION 
    AND DETAILED INDEXES ALL;

Technique #4

Export data to an operating system pipe and load data from pipe to target remote database server.

Figure 4. Technique #4
diagram: export data over network to an OS pipe, then import/load from OS pipe to target database

Follow these steps to implement this technique.

  1. Catalog the source database in the target database server:
    CATALOG TCPIP NODE SourceND REMOTE SourceDBServer.ibm.com SERVER 50001;
    CATALOG DATABASE SourceDB AT NODE SourceND;
  2. Create an operating system pipe in the target database server.
     mkfifo datapipe
     ls –ltr datapipe
     prw-r--r--  1 bculinux bcuigrp     0 2011-09-18 16:32 datapipe
  3. Connect to the source database remotely from the target server.
     CONNECT TO SourceDB user <username> using <password>;
  4. Export data from source database and write it to one end of the OS pipe (datapipe). In the business scenario, the database administration team refreshed UAT from PROD. only for the period 2011 of records 12904084.
     EXPORT TO datapipe OF DEL MODIFIED BY COLDEL, MESSAGES
         FACT_CUST_FPI_VALIDATION.EXP.msg SELECT * FROM DATAMARTS.F_CUST_FPI_VALIDATION 
         WHERE REC_LOAD_DT > '2011-01-01-00.00.00.000000' WITH UR;
  5. Connect to the source database remotely from the target server.
    CONNECT TO TargetDB user <username> using <password>;
  6. Import or load data from another end of OS pipe into a regular hash range partitioned table.
    IMPORT FROM datapipe OF DEL MODIFIED BY COLDEL, MESSAGES
          FACT_CUST_FPI_VALIDATION.IMP.msg INSERT INTO
          DATAMARTS.FACT_CUST_FPI_VALIDATION;
                
     LOAD FROM datapipe OF DEL MODIFIED BY COLDEL, MESSAGES
          FACT_CUST_FPI_VALIDATION.LD.msg INSERT INTO
          DATAMARTS.FACT_CUST_FPI_VALIDATION;

    Note: Be cautious when using the load command in a partitioned database having many range partitions. Load may fail with SQLCODE SQL0973N, and you may lose the data in the target table.

Take an application snapshot of the target database. You will see an insert occurring on the target database server table. Until the data transfer is complete, you will see both the export to pipe and import from pipe running on the server.

The following listings summarize the status at source and target after completion:

Listing 1. From the source database connection
db2 "EXPORT TO datapipe OF DEL MODIFIED BY COLDEL, MESSAGES FPI_VALIDATION.EXP.msg 
		SELECT * FROM DATAMARTS.F_CUST_FPI_VALIDATION WHERE REC_LOAD_DT >
		'2011-01-01-00.00.00.000000' WITH UR"
Number of rows exported: 	12904084
Listing 2. From the target database connection
db2 "IMPORT FROM datapipe OF DEL MODIFIED BY COLDEL, MESSAGES
        FPI_VALIDATION.IMP.msg INSERT INTO
		DATAMARTS.FACT_CUST_FPI_VALIDATION"
Number of rows read         	= 12904084
Number of rows skipped      	= 0
Number of rows inserted     	= 12904084
Number of rows updated    		= 0
Number of rows rejected     	= 0
Number of rows committed		= 12904084

To remove the operating system pipe after completion, use the following command:

rm datapipe

Limitations and workarounds for technique #4

When loading data into a heavily range-partitioned table in a database partitioned environment, you may encounter the following error.

SQL0973N Not enough storage is available in the "UTIL_HEAP_SZ" heap to process the statement. SQLSTATE=57011

Here are some ways to work around the problem:

  1. Increase the UTIL_HEAP_SZ database configuration parameter value to a maximum of 524288 pages of 4K page size. Then force all applications and reactivate the database.
  2. Decrease DATA BUFFER number while loading the data.
  3. If the previous two steps don't enable you to complete the load, perform a load terminate.
  4. Create a temporary table DATAMARTS.FACT_CUST_FPI_VALIDATION_TMP without range partitions and perform a load on the temporary table from the pipe with nonrecoverable clause to make the load faster.
  5. Upon completion, perform an INSERT INTO ... SELECT * FROM.... You can perform a NOT LOGGED INITIALLY insert to improve the performance of the data loads. The listing below shows such an insert.
Listing 3. NOT LOGGED INITIALLY inserts
db2 +c "ALTER TABLE DATAMARTS.FACT_CUST_FPI_VALIDATION
     ACTIVATE NOT LOGGED INITIALLY"
db2 "INSERT INTO DATAMARTS.FACT_CUST_FPI_VALIDATION SELECT * 
     FROM DATAMARTS.FACT_CUST_FPI_VALIDATION_TMP"

Note: If a NOT LOGGED INITIALLY insert fails for some reason, you must recreate the table.

Technique #5

Export data on local database server, transfer the data file, and load data locally on target database server (small tables with hash partitioned or non-hash partitioned).

Figure 5. Technique #5
diagram: export data from multiple partitions, copy files over network, then import/load into partitions at target database

At the source database server, perform the following steps to export data.

  1. Create soft links from the admin node export directory to all the data nodes. In this example, the export directory is $HOME/db2backup/exports.
    ln -s /db2fs/bculinux/NODE0001 NODE1
    ln -s /db2fs/bculinux/NODE0002 NODE2
    .........
    ln -s /db2fs/bculinux/NODE0040 NODE40
  2. The next listing shows the resulting files after creating the soft links.
    ls –ltr
    lrwxrwxrwx 1 bculinux bcuigrp  24 2011-04-13 19:25 NODE1 -> /db2fs/bculinux/NODE0001
    lrwxrwxrwx 1 bculinux bcuigrp  24 2011-04-13 19:25 NODE2 -> /db2fs/bculinux/NODE0002
    .........
    lrwxrwxrwx 1 bculinux bcuigrp  24 2011-04-13 19:28 NODE40 -> /db2fs/bculinux/NODE0040
  3. In each physical data node server, create directory structure similar to this.
    mkdir –p /db2fs/bculinux/NODE0001/exports/datamarts
    mkdir –p /db2fs/bculinux/NODE0002/exports/datamarts
    .........
    mkdir –p /db2fs/bculinux/NODE0040/exports/datamarts
  4. Find the hash partition columns from SYSCAT.COLUMNS for the table that needs to be exported.
     db2 "SELECT SUBSTR(COLNAME,1,20) COLNAME, PARTKEYSEQ FROM
           SYSCAT.COLUMNS WHERE TABNAME=''F_CUST_PROFTBLTY_TMP' AND
           TABSCHEMA='DATAMARTS'"
           
     COLNAME              PARTKEYSEQ
    -------------------- ----------
    ACCT_KEY                      0
    BUSS_UNIT_KEY                 1
    CALENDAR_MONTH_KEY            2
    CRNCY_CODE                    0

    In this table we have two hash partition columns. We choose one of the hash columns to export data in the respective partition.

  5. Export data across all the partitions in parallel using the DB2 EXPORT command as shown in this listing.
     db2_all "\"|| db2 \"EXPORT TO 
    $HOME/db2backup/exports/NODE##/exports/datamarts/
         DATAMARTS.F_CUST_PROFTBLTY_TMP.del OF DEL SELECT * FROM 
         DATAMARTS.F_CUST_PROFTBLTY_TMP WHERE DBPARTITIONNUM 
    (BUSS_UNIT_KEY)=##\""

    This command exports each partition data to respective partition nodes.

  6. Perform file copy from each source database server node to the target database server using scp.
     scp -p <File Name>  <User Name>@<Host Name>:<File Name>

At the target database server, perform the following steps to export the data.

  1. Create soft links as illustrated in the source database, as shown in Listing 4.
  2. Load from each partition data in parallel using DB2 load command.
    Listing 4. Parallel LOAD command
     db2_all "<<-0<<\" db2 -v \"LOAD FROM db2backup/exports/NODE##/exports/datamarts
        DATAMARTS.F_CUST_PROFTBLTY_TMP.del OF DEL INSERT INTO
        DATAMARTS.F_CUST_PROFTBLTY_TMP NONRECOVERABLE \""

    This command would load each partition data to respective partition nodes.


Pros and cons of each technique

Table 2 highlights the benefits and drawbacks of these techniques.

Table 2. Pros and Cons
TechniqueProsCons
1. Export locally and load locally
  • Easy usage
  • Efficient method to use with dimension table data that is residing on the admin node
  • Requires file system space to hold the export data file on both source and target servers
  • If the data is distributed across all the data nodes in a warehouse, you may not be able to efficiently utilize the data node system resources, so this is not a preferred method during a big fact tables refresh.
2. Export locally and load remotely
  • Easy usage
  • Requires file system space on source server
  • While performing load remotely from a source server, load processes utilize the source server system resources, so this is not as ideal as most of the times when the source server is the production server
3. Export remotely and load locally
  • Easy usage
  • Requires file system space on target server
  • This method mostly makes use of the target server system resources, so there are no concerns on production performance, assuming there are only reports running on the production system and doing export with uncommitted read isolation level
4. Export and load using pipe
  • Doesn't require file system space to store the data file on either source or target servers
  • If the pipe breaks in between, the only way to start making progress is to start the export and load fresh.
5. Parallel export and parallel load
  • Pretty fast compared to any other technique for big fact tables having data distributed across all the partitions
  • Doesn't need huge chunk of disk space in one file system
  • Makes use of space across each partition node file system
  • Makes use of system resources evenly across all the data nodes
  • This method requires file system space to store the exported data in source as well as in target data partition node file systems.
  • This technique needs to create soft links and directory structure for each partition both in source and target servers.

Conclusion

This article has described how to use the DB2 data movement utilities to meet requirements in a balanced warehouse environment with special considerations for multi-partitioned fact tables. You should now have a better understanding of the issues which can occur when working with huge fact tables that are heavily range partitioned. If you encounter LBAC protected data, please refer back to IBM Information Center for more details.


Acknowledgement

Special thanks to Nelson Coish, President, Coish Consulting Inc for helping the author to complete and fine-tune the article.

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=773456
ArticleTitle=Effectively use DB2 data movement utilities in a data warehouse environment
publish-date=11172011