Introduction to the DB2 Continuous Data Ingest feature

Getting started with new DB2 INGEST command

IBM® DB2® 10.1 for Linux®, Unix® and Windows® has introduced the Continuous Data Ingest feature. The INGEST command provides an alternative to the known methods of previous DB2 versions: IMPORT and LOAD. In this article you will understand the main Ingest concepts and find examples and situations where this tool will help you to populate your databases.

Share:

Wiliam Righi (righipw@br.ibm.com), DB2 Database Administrator, IBM

Author Photo of Wiliam RighiWiliam Righi is a database specialist working with DB2 for more than 10 years. He's always been very enthusiastic about new DB2 features. Since he has joined IBM in 2005, Bill has been supporting internal and external accounts in the IBM GTS organization. Currently, Bill is focused in root cause analysis of major incidents and shell scripting development for IBM customers. Whenever he gets a chance, Bill likes to play soccer, read about spiritualism matters, and hang out with his family.



18 April 2013

Also available in Russian

Introduction

IBM DB2 10.1 for Linux, UNIX, and Windows has introduced the concept of Continuous Data Ingest (CDI). This new feature provides the ability to move data into DB2 tables with no impact to running applications. The main target of this technology is to feed near-real-time data marts and data warehouses. In such environments, the up-to-date data is so critical that customers can not wait for the next load windows to get access to newer data.

The CDI feature is implemented by the Ingest tool, a client-side utility that uses parallel sessions to insert data into the tables. You have the flexibility to deploy your solution on the server or client side. The tool also provides parameters to control the execution, such as defining the number of parallel threads that will insert or format data. This new method is supported by single and multi-partitioned databases, Purescale environments, and HADR-enabled databases.

In this article you will start with the basics—your first INGEST command, then explore the formatting features, setting up the Ingest for performance and concurrency and then explain the architecture of the process. Finally you'll see some real scenarios where CDI may help customers to achieve their business goals.


Your first INGEST command

Before you start using the INGEST command, you need to create the control table (SYSTOOLS.INGESTRESTART) used by the tool to recover and restart failed operations. This table is created using the procedure shown in Listing 1.

Listing 1. Creating Ingest control tables
db2 'call sysproc.sysinstallobjects ( 'INGEST' ,'C' , null, null )';

Once the control table is created, the simplest method to invoke the Ingest utility is using a command similar to what is shown in Listing 2.

Listing 2. First INGEST command
db2 'ingest from file tab1.del format delimited insert into tab1';

This command will insert into tab1 all the rows read from tab1.del file. In this example, tab1.del is an ASCII file delimited by comma (,) that can be produced by any tool like DB2 EXPORT.

Tool restrictions

  • The INGEST command does not support the DB2 IXF format like the other tools. Currently, the supported input formats are ASCII delimited or positional.
  • XML and LOB data types are also not supported. If you need to load these data types, use LOAD or IMPORT instead.
  • Ingest cannot be used if you need to load GENERATED ALWAYS or SYSTEM_TIME columns. In such cases, you may use the LOAD tool with the identityoverride clause.

Parsing input file

In many cases, the input file is not in a format to be processed directly by the database as seen in the first example. In such cases, Ingest provides ways for you to parse the input file. Basically, the two types of input files are delimited and positional.

Working with delimited files

In delimited input files, the fields and rows are separated by a special character. By default, the fields are delimited by comma (,) and every new line in the input file represents a new record to be inserted. If your input file has any other delimiter character, you can use the command shown in Listing 3 to specify it.

Listing 3. Using the delimited by clause
db2 "ingest from file tab1.del format delimited by ';' insert into tab1";

Formatting the input values

If the input values are not in the format expected by DB2, you can use additional clauses of INGEST command to describe the input file format. This is very useful when working with files from other systems that may have different ways to represent data such as timestamps, numbers, and so on, as shown in Listing 4.

Listing 4. Formatting input values
db2 'ingest from file tab1.del 
     format delimited (
         $col1 INTEGER EXTERNAL,
         $col2 CHAR(10),
         $col3 DATE 'dd-mm-yyyy'
     ) insert into tab1 values ($col1, $col2, $col3)';

Tip: The EXTERNAL clause must be used in numeric data types when the values are specified in the ASCII (not binary) representation.

Additionally, the format clause allows you to make simple transformations in the input values (such as sums or concatenations) or even omit some input fields, as shown in Listing 5.

Listing 5. Formatting input values
db2 'ingest from file tab1.del 
     format delimited (
         $fname CHAR(30),
         $lname CHAR(30),
         $birth DATE 'dd-mm-yyyy'
    ) insert into tab1 (name, birthdate) 
         values ($lname CONCAT ',' CONCAT $fname, $birth)';

Working with positional files

Positional input files are not delimited by any special character. Instead, the column limits are defined by the length and position of each field in the input file, as shown in Listing 6.

Listing 6. Example of positional input file
A00    SPIFFY COMPUTER SERVICE DIV.         000010 A00      -
B01    PLANNING                             000020 A00      -
C01    INFORMATION CENTER                   000030 A00      -
D01    DEVELOPMENT CENTER                   -      A00      -
D11    MANUFACTURING SYSTEMS                000060 D01      -
D21    ADMINISTRATION SYSTEMS               000070 D01      -
E01    SUPPORT SERVICES                     000050 A00      -
E11    OPERATIONS                           000090 E01      -
E21    SOFTWARE SUPPORT                     000100 E01      -
F22    BRANCH OFFICE F2                     -      E01      -
G22    BRANCH OFFICE G2                     -      E01      -
H22    BRANCH OFFICE H2                     -      E01      -
I22    BRANCH OFFICE I2                     -      E01      -
J22    BRANCH OFFICE J2                     -      E01      -

In the example previously shown in Listing 6, you must know that the first column starts at position 1 and ends at position 7. The second column starts at position 8 and goes to position 44, and so on. If the value does not fill the whole space, blank spaces are inserted until the next value in order to make the record size fixed.

To describe a positional input file, you need to use the POSITIONAL clause and specify the limits of each column, as shown in Listing 7.

Listing 7. INGEST command with POSITIONAL clause
db2 'ingest from file tab1.del 
     format positional (
         $col1 POSITION(1:7) CHAR(7),
         $col2 POSITION(8:44) CHAR(37),
         $col3 POSITION(45:51) CHAR(7),
	 $col4 POSITION(52:59) CHAR(9),
	 $col5 POSITION(60:69) CHAR(9)
     ) insert into tab1 values ($col1, $col2, $col3, $col4, $col5)';

Handling NULLs in a positional file

In positional files, you may have problems to work with null values. In the positional input file example shown previously in Listing 7, there was a special character to represent NULL values: '-' (dash). By default, the Ingest utility will not interpret this as a null value, but as a string containing a single character ('-'). In such cases, you must have a way to inform the Ingest utility to consider this special character as the NULL indicator. You can do this by using the DEFAULTIF clause, as shown in Listing 8.

Listing 8. Handling NULLs in positional ingest operations
db2 'ingest from file tab1.del 
     format positional (
         $col1 POSITION(1:7) CHAR(7),
         $col2 POSITION(8:44) CHAR(37),
         $col3 POSITION(45:51) CHAR(7) DEFAULTIF ='-',
	 $col4 POSITION(52:59) CHAR(9),
	 $col5 POSITION(60:69) CHAR(9) DEFAULTIF ='-'
     ) insert into tab1 values ($col1, $col2, $col3, $col4, $col5)';

Using the command shown previously in Listing 8, whenever the value read from input file starts with '-', the DEFAULT value for the column will be used. The default value for nullable columns is NULL. For non-nullable columns, the default value may be specified in the CREATE or ALTER table statements.


Exploring other possibilities

Loading from multiple files at once

You can specify multiple input files for a single ingest call. This may be particularly useful in ETL jobs that have several files to process, as shown in Listing 9.

Listing 9. Specifying multiple input files
db2 'ingest from file1.del, file2.del format delimited insert into tab1';

Tip: You can use very simple shell scripts to automate the insert of multiple files, as shown in Listing 10.

Listing 10. Sample script to ingest multiple files
#Assuming file name standard INGxx
FILES=$(ls -m ING*)
db2 connect to target
db2 “ingest from file $FILES format delimited insert into tab1”

Using pipes as input

Ingesting from pipes gives you the option to avoid creating large input files if you are moving data located in the same server. For example, to move data from a database under db2inst1 into a database located under db2inst2, as shown in Listing 11.

Listing 11. Ingesting data from pipes
# From db2inst1 session 
mkfifo /tmp/fifo   
db2 'export to /tmp/fifo of del select * from source_table'
	
# From db2inst2 session
db2 'ingest from /tmp/fifo format delimited insert into target_table'

Tip: This approach can be used in remote database connections as well. You just need to have the remote database (source) cataloged in your local directory.


Updating, deleting or merging data

In the examples given so far, you have used the ingest tool to insert (append) data into DB2 tables. Unlike the other methods available, the ingest utility provides a variety of SQL operations like update, merge and delete. The great benefit of this method is to provide ways to run these operations in parallel using ingest parallel architecture. Listing 12 shows some examples of operations with ingest.

Listing 12. Updating, deleting or merging data
# Updating rows
db2 'ingest from file tab1.del 
     format delimited (
         $col1 INTEGER EXTERNAL,
         $col2 CHAR(10),
         $col3 DATE 'dd-mm-yyyy'
     ) update tab1 set (name, date) = ($col2, $col3)
       where id=$col1'

# Using the MERGE command to insert or update
db2 'ingest from file tab1.del 
     format delimited (
         $col1 INTEGER EXTERNAL,
         $col2 CHAR(10),
         $col3 DATE 'dd-mm-yyyy'
     ) merge into tab1 on (id=$col1)
         when matched then
              update set (name, date) = ($col2, $col3)
         when not matched then
              insert values ($col1, $col2, $col3)'

# Deleting the rows read from input file
db2 'ingest from file tab1.del 
     format delimited (
         $col1 INTEGER EXTERNAL,
         $col2 CHAR(10),
         $col3 DATE 'dd-mm-yyyy'
     ) delete from tab1
           where (id=$col1)'

Restarting ingest operations

The ingest utility provides a method to restart failed operations. In order to use these options, you must create control table described in item 1. To initiate a restartable ingest operation, you need to use the RESTART NEW clause as shown in Listing 13.

Listing 13. Initiating a restartable ingest operation
db2 'ingest from file tab1.del 
     format positional (
         $col1 POSITION(1:10) INTEGER EXTERNAL,
         $col2 POSITION(11:20) CHAR(10),
         $col3 POSITION(21:30) DATE 'dd-mm-yyyy'
     ) restart new 'ingest-20130223' 
       insert into tab1 values ($col1, $col2, $col3)'

The key 'ingest-20130223' will be used to uniquely identify the ingest operation in the control table. If you don't specify any value, DB2 will generate a unique key for you, but in case of a restart, you will need to query the control table (SYSTOOLS.INGESTRESTART) to identify it, as shown in Listing 14.

Listing 14. Restarting ingest operation
db2 'ingest from file tab1.del 
     format positional (
         $col1 POSITION(1:10) INTEGER EXTERNAL,
         $col2 POSITION(11:20) CHAR(10),
         $col3 POSITION(21:30) DATE 'dd-mm-yyyy'
     ) restart continue 'ingest-20130223' 
       insert into tab1 values ($col1, $col2, $col3)'

Using the information stored in the control table, the ingest utility will check how many rows were successfully processed before the failure and restart from that point.


Modifying ingest parameters

Use the command INGEST SET to set up parameters that affect ingest performance and operation. The INGEST SET command works at the session level, so every time you connect to the database you need to set the ingest parameters again. Among the options provided, you can specify the number of formatter and flusher threads, commit counts or periods and retry counts, as shown in Listing 15.

Listing 15. Using INGEST SET command
db2 ingest set commit_count 5000;
db2 ingest set commit_period 0;
db2 'ingest from file tab1.del 
     format positional (
         $col1 POSITION(1:10) INTEGER EXTERNAL,
         $col2 POSITION(11:20) CHAR(10),
         $col3 POSITION(21:30) DATE 'dd-mm-yyyy'
     ) insert into tab1 values ($col1, $col2, $col3)'

Other options available.

  • commit_period: Default 1 second. Specify the number of seconds between COMMITs. If using a high value, make sure the locking parameters are correctly set up to support the concurrency.
  • num_flushers_per_partition: Number of sessions used by ingest to flush data into a database.
  • num_formatters: Number of formatters threads configuration parameter.
  • pipe_timeout: The pipe timeout configuration parameter.
  • retry_count: In some cases, the failed transaction may be recovered by resubmitting the command. In such cases, this parameter will indicate how many attempts will be made. Examples of recoverable failures are lock timeouts, deadlocks, connection failure (during HADR takeover for example), among others.
  • retry_period: Specify the wait time between retry attempts.
  • shm_max_size: Maximum size of shared memory configuration parameter used by the utility.

Ingest architecture

The INGEST command is a client tool that can be invoked from both client or server side. This gives you flexibility to start the INGEST command where your input files are located, as shown in Figure 1.

Figure 1. INGEST command executed from client workstation
Ingest command executed from client workstation

In this first example, the INGEST command is started by a DB2 Client running on the user's workstation. The DB2 client will be responsible for reading the files (or pipes), format the contents and perform remote INSERT operations into DB2 database. This kind of solution may be impacted by the network speed.

In order to avoid this network bottleneck, you may choose to run the INGEST command locally on the DB2 server. In this case, the input files must be copied into DB2 server before the ingest execution, as shown in Figure 2.

Figure 2. INGEST command executed locally from DB2 Server
This figure shows the Ingest command executed locally from DB2 Server

Internally, ingest architecture is made of a multi-threaded process. There are three types of threads (Transporters, Formatters, and Flushers) that are responsible for each phase of the ingest operation, as shown in Figure 3.

Figure 3. Ingest multi-threaded architecture
This figures shows the Ingest multi-threaded architecture, including transporters, formatters, and flushers.
  • Transporter is responsible for reading the input files or pipes.
  • Formatter performs basic formatting of input files as specified in the INGEST command. The number of formatters may be tuned using the num_formatters parameter of the INGEST SET command.
  • Flusher writes data into DB2 tables using array insert operations. They are also responsible for transaction control. You can adjust the number of these threads via num_flushers_per_partition parameter.

Tuning ingest performance

Due to its parallel architecture, tuning the ingest operation must be done carefully in order to achieve maximum throughput. It's also very important to understand the underlining server resources like numbers of CPUs and amount of memory available.

Locking parameters

Adjusting locking parameters is critical to ingest throughput. If LOCKLIST and MAXLOCKS parameters are not correctly set up, the ingest threads may lock each other (due to lock escalations), reducing the overall performance. You also need to consider the lock contention caused by other running transactions on the database.

As a general recommendation, you should enable DB2 Self Tuning Memory Management (STMM) for the database (SELF_TUNING_MEM=ON) and set up both LOCKLIST and MAXLOCKS to AUTOMATIC. Using this method, DB2 will adjust these parameters at run time based on the current workload.

Transaction logging

Always keep in mind the large ingest operations will produce many logs. Review all your logging parameters that affect the active log space, such as LOGFIZSZ, LOGPRIMARY and LOGSECOND to make sure it's not going to impact the ingest operation.

Also review your log archiving mechanism. As there will be more log files being produced, your log archiving process should be able to handle this new workload. If you're performing log archiving to disk, make sure the log archiving destination (disk or file system) will be able to store the higher number of logs generated.

Another concern is for HADR environments. All the data loaded by ingest will be automatically replicated to the standby database via the HADR feature. Make sure you have enough network bandwidth between primary and standby databases to support this ingest operation, otherwise the standby database may not be able to keep data in synch with primary. Usually in such cases, the standby database will be able to catch up with the primary once the ingest is finished. But while the operation is going on, you may compromise your high availability environment as the standby database won't have all the transactions applied in case of a failover.

Adjusting ingest parameters

Some of the parameters that be can set up via the INGEST SET command directly affect the performance of your data ingestion.

Avoid unwanted commits

If you plan to control commits only by commit_count parameter, make sure you set commit_period to 0. Otherwise, the default commit_period of 1 second will also apply, and your transactions will COMMIT based on both criteria.

commit_count and commit_period: If set up to very low values, the performance of the ingest will be affected due to the high number of transaction commits performed. Usually you should balance the performance and the concurrency to set up these parameters. On the other hand, setting up these parameters to a very high value may negatively affect the DB2 locking mechanism. If you omit both parameters, ingest will commit every 1 second.

num_flushers_per_partition: By default, the INGEST command will compute the optimal value for this parameter based on logical CPUs available. If you see idle CPU resources during the ingest operation and want to force the use of more processing threads, you can try to increase this parameter. For systems with Database Partition Feature using logical partitions (more than one database partition in a single physical server), be aware that this parameter takes effect at partition-level, so the number of threads will be multiplied by the number of logical partitions in the server.

num_formatters and shm_max_size: By default, ingest computes optimal values for these parameters at startup time.


Monitoring the ingest progress

Like any other utility running on DB2, you need to monitor the execution of ingest operations. INGEST LIST and INGEST GET STATS are the two commands used to monitor an ingest operation.

The INGEST LIST command

This is the basic command to check if there is any ingest operation executing on the server. Among the information provided, you can identify the table being loaded, start time, and number of rows processed so far. Another piece of important information is the job ID, which is used to get detailed information about the ingest job, as shown in Listing 16.

Listing 16. The job ID
db2inst2@asdp:~$ db2 ingest list
 
Ingest job ID               = DB21001:20130312.143314.008948:00002:00005
Ingest temp job ID          = 1
Database Name               = SAMPLE
Target table                = DB2INST2.DEPARTMENT
Input type                  = FILE
Start Time                  = 03/12/2013 14:33:14.406157
Running Time                = 00:00:06
Number of records processed = 22000

DB20000I  The INGEST LIST command completed successfully.

The INGEST GET STATS command

In order to get more details about the ingest operation, the command INGEST GET STAT is provided. This command is particularly useful to check ingest performance indicators. In the example shown in Listing 17, the command was invoked to collect performance data from job ID=1 every 1 second during the execution of the ingest.

Listing 17. INGEST GET STATS example
db2inst2@asdp:~$ db2 ingest get stats for 1 every 1 seconds
 
Ingest job ID               = DB21001:20130312.143701.363666:00002:00005
Database Name               = SAMPLE
Target table                = DB2INST2.DEPARTMENT
 
Overall           Overall           Current           Current
ingest rate       write rate        ingest rate       write rate
(records/second)  (writes/second)   (records/second)  (writes/second)   Total records
----------------- ----------------- ----------------- ----------------- -----------------
            20000              1600            100000              8000              8000
            16666              3666                 0             14000             22000
            14285              3142                 0                 0             22000
            12500              4375                 0             13000             35000
            11111              5666                 0             16000             51000
            10000              5900                 0              8000             59000
             9090              5363                 0                 0             59000
             8333              7083                 0             26000             85000
             7692              6538                 0                 0             85000
             7142              6714                 0              9000             94000
             6666              6666                 0              6000            100000

DB20000I  The INGEST GET STATS command completed successfully.

Comparing ingest to other methods

Prior to IBM DB2 10.1, you had to choose between IMPORT or LOAD utilities to move data into tables. The ingest utility provides an alternative to the other methods as it allies the high throughput of data insertion and high concurrency with ongoing database applications. Table 1 shows a very short comparison between the methods.

Table 1. Comparing ingest to other data movement methods
AttributeIMPORTINGESTLOAD
SpeedSlow. Use sequential INSERT operationsFast. Use parallel INSERT operationsVery fast. Data inserted directly into DB2 containers
ConcurrencyHigh, although some table level lockings may occurVery high, use row level lockingLow. Only read operations are allowed
LoggingYesYesNo, data inserted is not logged
Behavior on failureRollback. Table is accessibleRollback. Table is accessibleTable is put in Loading Pending state. Manual intervention is required
LOBs and XML DataYesNoYes

Use cases of Continuous Data Ingest

To make the CDI concept clear, the following are some common business situations where the ingest utility may help customers meet their objectives.

Near real-time data warehouses

In databases, specifically data marts and data warehouses, where data must be updated in a near real-time basis, the users cannot wait for the next load window to get fresh new data. Additionally, they cannot afford to have offline LOAD jobs running during business hours.

In these cases, ingest can be used to insert data into the tables without an impact to the running queries or applications. The high concurrency levels provided by ingest virtually eliminate the need for an offline window to populate the database. Additionally, the performance provided by the tool guarantees large amounts of data will be loaded with high throughput. Tests show that ingest performance is comparable to the time spent in standard approach using LOAD into staging table followed by INSERT/SELECTs into target tables.

Populate HADR databases

Using the LOAD utility in HADR databases is particularly complicated. The only way to use LOAD in an HADR database is to specify the COPY YES clause. Nevertheless, it requires a shared file system between primary and standby servers, or manual DBA intervention to keep the HADR in-synch after the LOAD.

Ingest provides a better option in this case as it's fully logged, so HADR will automatically replicate all the inserted rows into the standby server without any user intervention.


Conclusion

Decision support systems are critical to companies, and the number of users accessing BI data is increasing every day. The new data warehouse applications require fresh new data to be available for queries as soon as possible. To support such demand of online analytical data, IBM DB2 10.1 has introduced a new feature called Continuous Data Ingest (CDI). The great benefit of this feature is to provide the ability to insert data into databases and data warehouses with high throughput and minimal impact to ongoing analytical applications.

This article introduced the new ingest tool and explained how you can code your first INGEST commands, how to monitor the execution, and also presented some scenarios where this feature can replace the old methods of loading data into the data warehouses.

Resources

Learn

Get products and technologies

  • Build your next development project with IBM trial software, available for download directly from developerWorks.
  • Evaluate IBM products in the way that suits you best: Download a product trial, try a product online, use a product in a cloud environment, or spend a few hours in the SOA Sandbox learning how to implement Service Oriented Architecture efficiently.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=863747
ArticleTitle=Introduction to the DB2 Continuous Data Ingest feature
publish-date=04182013