- Your first INGEST command
- Parsing input file
- Exploring other possibilities
- Updating, deleting or merging data
- Restarting ingest operations
- Modifying ingest parameters
- Ingest architecture
- Tuning ingest performance
- Monitoring the ingest progress
- Comparing ingest to other methods
- Use cases of Continuous Data Ingest
- Downloadable resources
- Related topics
Introduction to the DB2 Continuous Data Ingest feature
Getting started with new DB2 INGEST command
IBM DB2 10.1 for Linux, UNIX, and Windows has introduced the concept of Continuous Data Ingest (CDI). This new feature provides the ability to move data into DB2 tables with no impact to running applications. The main target of this technology is to feed near-real-time data marts and data warehouses. In such environments, the up-to-date data is so critical that customers can not wait for the next load windows to get access to newer data.
The CDI feature is implemented by the Ingest tool, a client-side utility that uses parallel sessions to insert data into the tables. You have the flexibility to deploy your solution on the server or client side. The tool also provides parameters to control the execution, such as defining the number of parallel threads that will insert or format data. This new method is supported by single and multi-partitioned databases, Purescale environments, and HADR-enabled databases.
In this article you will start with the basics—your first INGEST command, then explore the formatting features, setting up the Ingest for performance and concurrency and then explain the architecture of the process. Finally you'll see some real scenarios where CDI may help customers to achieve their business goals.
Your first INGEST command
Before you start using the INGEST command, you need to create the control table (SYSTOOLS.INGESTRESTART) used by the tool to recover and restart failed operations. This table is created using the procedure shown in Listing 1.
Listing 1. Creating Ingest control tables
db2 'call sysproc.sysinstallobjects ( 'INGEST' ,'C' , null, null )';
Once the control table is created, the simplest method to invoke the Ingest utility is using a command similar to what is shown in Listing 2.
Listing 2. First INGEST command
db2 'ingest from file tab1.del format delimited insert into tab1';
This command will insert into tab1 all the rows read from tab1.del file. In this example, tab1.del is an ASCII file delimited by comma (,) that can be produced by any tool like DB2 EXPORT.
- The INGEST command does not support the DB2 IXF format like the other tools. Currently, the supported input formats are ASCII delimited or positional.
- XML and LOB data types are also not supported. If you need to load these data types, use LOAD or IMPORT instead.
- Ingest cannot be used if you need to load GENERATED ALWAYS or SYSTEM_TIME columns. In such cases, you may use the LOAD tool with the identityoverride clause.
Parsing input file
In many cases, the input file is not in a format to be processed directly by the database as seen in the first example. In such cases, Ingest provides ways for you to parse the input file. Basically, the two types of input files are delimited and positional.
Working with delimited files
In delimited input files, the fields and rows are separated by a special character. By default, the fields are delimited by comma (,) and every new line in the input file represents a new record to be inserted. If your input file has any other delimiter character, you can use the command shown in Listing 3 to specify it.
Listing 3. Using the delimited by clause
db2 "ingest from file tab1.del format delimited by ';' insert into tab1";
Formatting the input values
If the input values are not in the format expected by DB2, you can use additional clauses of INGEST command to describe the input file format. This is very useful when working with files from other systems that may have different ways to represent data such as timestamps, numbers, and so on, as shown in Listing 4.
Listing 4. Formatting input values
db2 'ingest from file tab1.del format delimited ( $col1 INTEGER EXTERNAL, $col2 CHAR(10), $col3 DATE 'dd-mm-yyyy' ) insert into tab1 values ($col1, $col2, $col3)';
Tip: The EXTERNAL clause must be used in numeric data types when the values are specified in the ASCII (not binary) representation.
Additionally, the format clause allows you to make simple transformations in the input values (such as sums or concatenations) or even omit some input fields, as shown in Listing 5.
Listing 5. Formatting input values
db2 'ingest from file tab1.del format delimited ( $fname CHAR(30), $lname CHAR(30), $birth DATE 'dd-mm-yyyy' ) insert into tab1 (name, birthdate) values ($lname CONCAT ',' CONCAT $fname, $birth)';
Working with positional files
Positional input files are not delimited by any special character. Instead, the column limits are defined by the length and position of each field in the input file, as shown in Listing 6.
Listing 6. Example of positional input file
A00 SPIFFY COMPUTER SERVICE DIV. 000010 A00 - B01 PLANNING 000020 A00 - C01 INFORMATION CENTER 000030 A00 - D01 DEVELOPMENT CENTER - A00 - D11 MANUFACTURING SYSTEMS 000060 D01 - D21 ADMINISTRATION SYSTEMS 000070 D01 - E01 SUPPORT SERVICES 000050 A00 - E11 OPERATIONS 000090 E01 - E21 SOFTWARE SUPPORT 000100 E01 - F22 BRANCH OFFICE F2 - E01 - G22 BRANCH OFFICE G2 - E01 - H22 BRANCH OFFICE H2 - E01 - I22 BRANCH OFFICE I2 - E01 - J22 BRANCH OFFICE J2 - E01 -
In the example previously shown in Listing 6, you must know that the first column starts at position 1 and ends at position 7. The second column starts at position 8 and goes to position 44, and so on. If the value does not fill the whole space, blank spaces are inserted until the next value in order to make the record size fixed.
To describe a positional input file, you need to use the POSITIONAL clause and specify the limits of each column, as shown in Listing 7.
Listing 7. INGEST command with POSITIONAL clause
db2 'ingest from file tab1.del format positional ( $col1 POSITION(1:7) CHAR(7), $col2 POSITION(8:44) CHAR(37), $col3 POSITION(45:51) CHAR(7), $col4 POSITION(52:59) CHAR(9), $col5 POSITION(60:69) CHAR(9) ) insert into tab1 values ($col1, $col2, $col3, $col4, $col5)';
Handling NULLs in a positional file
In positional files, you may have problems to work with null values. In the positional input file example shown previously in Listing 7, there was a special character to represent NULL values: '-' (dash). By default, the Ingest utility will not interpret this as a null value, but as a string containing a single character ('-'). In such cases, you must have a way to inform the Ingest utility to consider this special character as the NULL indicator. You can do this by using the DEFAULTIF clause, as shown in Listing 8.
Listing 8. Handling NULLs in positional ingest operations
db2 'ingest from file tab1.del format positional ( $col1 POSITION(1:7) CHAR(7), $col2 POSITION(8:44) CHAR(37), $col3 POSITION(45:51) CHAR(7) DEFAULTIF ='-', $col4 POSITION(52:59) CHAR(9), $col5 POSITION(60:69) CHAR(9) DEFAULTIF ='-' ) insert into tab1 values ($col1, $col2, $col3, $col4, $col5)';
Using the command shown previously in Listing 8, whenever the value read from input file starts with '-', the DEFAULT value for the column will be used. The default value for nullable columns is NULL. For non-nullable columns, the default value may be specified in the CREATE or ALTER table statements.
Exploring other possibilities
Loading from multiple files at once
You can specify multiple input files for a single ingest call. This may be particularly useful in ETL jobs that have several files to process, as shown in Listing 9.
Listing 9. Specifying multiple input files
db2 'ingest from file1.del, file2.del format delimited insert into tab1';
Tip: You can use very simple shell scripts to automate the insert of multiple files, as shown in Listing 10.
Listing 10. Sample script to ingest multiple files
#Assuming file name standard INGxx FILES=$(ls -m ING*) db2 connect to target db2 “ingest from file $FILES format delimited insert into tab1”
Using pipes as input
Ingesting from pipes gives you the option to avoid creating large input files if you are moving data located in the same server. For example, to move data from a database under db2inst1 into a database located under db2inst2, as shown in Listing 11.
Listing 11. Ingesting data from pipes
# From db2inst1 session mkfifo /tmp/fifo db2 'export to /tmp/fifo of del select * from source_table' # From db2inst2 session db2 'ingest from /tmp/fifo format delimited insert into target_table'
Tip: This approach can be used in remote database connections as well. You just need to have the remote database (source) cataloged in your local directory.
Updating, deleting or merging data
In the examples given so far, you have used the ingest tool to insert (append) data into DB2 tables. Unlike the other methods available, the ingest utility provides a variety of SQL operations like update, merge and delete. The great benefit of this method is to provide ways to run these operations in parallel using ingest parallel architecture. Listing 12 shows some examples of operations with ingest.
Listing 12. Updating, deleting or merging data
# Updating rows db2 'ingest from file tab1.del format delimited ( $col1 INTEGER EXTERNAL, $col2 CHAR(10), $col3 DATE 'dd-mm-yyyy' ) update tab1 set (name, date) = ($col2, $col3) where id=$col1' # Using the MERGE command to insert or update db2 'ingest from file tab1.del format delimited ( $col1 INTEGER EXTERNAL, $col2 CHAR(10), $col3 DATE 'dd-mm-yyyy' ) merge into tab1 on (id=$col1) when matched then update set (name, date) = ($col2, $col3) when not matched then insert values ($col1, $col2, $col3)' # Deleting the rows read from input file db2 'ingest from file tab1.del format delimited ( $col1 INTEGER EXTERNAL, $col2 CHAR(10), $col3 DATE 'dd-mm-yyyy' ) delete from tab1 where (id=$col1)'
Restarting ingest operations
The ingest utility provides a method to restart failed operations. In order to use these options, you must create control table described in item 1. To initiate a restartable ingest operation, you need to use the RESTART NEW clause as shown in Listing 13.
Listing 13. Initiating a restartable ingest operation
db2 'ingest from file tab1.del format positional ( $col1 POSITION(1:10) INTEGER EXTERNAL, $col2 POSITION(11:20) CHAR(10), $col3 POSITION(21:30) DATE 'dd-mm-yyyy' ) restart new 'ingest-20130223' insert into tab1 values ($col1, $col2, $col3)'
The key 'ingest-20130223' will be used to uniquely identify the ingest operation in the control table. If you don't specify any value, DB2 will generate a unique key for you, but in case of a restart, you will need to query the control table (SYSTOOLS.INGESTRESTART) to identify it, as shown in Listing 14.
Listing 14. Restarting ingest operation
db2 'ingest from file tab1.del format positional ( $col1 POSITION(1:10) INTEGER EXTERNAL, $col2 POSITION(11:20) CHAR(10), $col3 POSITION(21:30) DATE 'dd-mm-yyyy' ) restart continue 'ingest-20130223' insert into tab1 values ($col1, $col2, $col3)'
Using the information stored in the control table, the ingest utility will check how many rows were successfully processed before the failure and restart from that point.
Modifying ingest parameters
Use the command INGEST SET to set up parameters that affect ingest performance and operation. The INGEST SET command works at the session level, so every time you connect to the database you need to set the ingest parameters again. Among the options provided, you can specify the number of formatter and flusher threads, commit counts or periods and retry counts, as shown in Listing 15.
Listing 15. Using INGEST SET command
db2 ingest set commit_count 5000; db2 ingest set commit_period 0; db2 'ingest from file tab1.del format positional ( $col1 POSITION(1:10) INTEGER EXTERNAL, $col2 POSITION(11:20) CHAR(10), $col3 POSITION(21:30) DATE 'dd-mm-yyyy' ) insert into tab1 values ($col1, $col2, $col3)'
Other options available.
- commit_period: Default 1 second. Specify the number of seconds between COMMITs. If using a high value, make sure the locking parameters are correctly set up to support the concurrency.
- num_flushers_per_partition: Number of sessions used by ingest to flush data into a database.
- num_formatters: Number of formatters threads configuration parameter.
- pipe_timeout: The pipe timeout configuration parameter.
- retry_count: In some cases, the failed transaction may be recovered by resubmitting the command. In such cases, this parameter will indicate how many attempts will be made. Examples of recoverable failures are lock timeouts, deadlocks, connection failure (during HADR takeover for example), among others.
- retry_period: Specify the wait time between retry attempts.
- shm_max_size: Maximum size of shared memory configuration parameter used by the utility.
The INGEST command is a client tool that can be invoked from both client or server side. This gives you flexibility to start the INGEST command where your input files are located, as shown in Figure 1.
Figure 1. INGEST command executed from client workstation
In this first example, the INGEST command is started by a DB2 Client running on the user's workstation. The DB2 client will be responsible for reading the files (or pipes), format the contents and perform remote INSERT operations into DB2 database. This kind of solution may be impacted by the network speed.
In order to avoid this network bottleneck, you may choose to run the INGEST command locally on the DB2 server. In this case, the input files must be copied into DB2 server before the ingest execution, as shown in Figure 2.
Figure 2. INGEST command executed locally from DB2 Server
Internally, ingest architecture is made of a multi-threaded process. There are three types of threads (Transporters, Formatters, and Flushers) that are responsible for each phase of the ingest operation, as shown in Figure 3.
Figure 3. Ingest multi-threaded architecture
- Transporter is responsible for reading the input files or pipes.
- Formatter performs basic formatting of input files as specified in the INGEST command. The number of formatters may be tuned using the num_formatters parameter of the INGEST SET command.
- Flusher writes data into DB2 tables using array insert operations. They are also responsible for transaction control. You can adjust the number of these threads via num_flushers_per_partition parameter.
Tuning ingest performance
Due to its parallel architecture, tuning the ingest operation must be done carefully in order to achieve maximum throughput. It's also very important to understand the underlining server resources like numbers of CPUs and amount of memory available.
Adjusting locking parameters is critical to ingest throughput. If LOCKLIST and MAXLOCKS parameters are not correctly set up, the ingest threads may lock each other (due to lock escalations), reducing the overall performance. You also need to consider the lock contention caused by other running transactions on the database.
As a general recommendation, you should enable DB2 Self Tuning Memory Management (STMM) for the database (SELF_TUNING_MEM=ON) and set up both LOCKLIST and MAXLOCKS to AUTOMATIC. Using this method, DB2 will adjust these parameters at run time based on the current workload.
Always keep in mind the large ingest operations will produce many logs. Review all your logging parameters that affect the active log space, such as LOGFIZSZ, LOGPRIMARY and LOGSECOND to make sure it's not going to impact the ingest operation.
Also review your log archiving mechanism. As there will be more log files being produced, your log archiving process should be able to handle this new workload. If you're performing log archiving to disk, make sure the log archiving destination (disk or file system) will be able to store the higher number of logs generated.
Another concern is for HADR environments. All the data loaded by ingest will be automatically replicated to the standby database via the HADR feature. Make sure you have enough network bandwidth between primary and standby databases to support this ingest operation, otherwise the standby database may not be able to keep data in synch with primary. Usually in such cases, the standby database will be able to catch up with the primary once the ingest is finished. But while the operation is going on, you may compromise your high availability environment as the standby database won't have all the transactions applied in case of a failover.
Adjusting ingest parameters
Some of the parameters that be can set up via the INGEST SET command directly affect the performance of your data ingestion.
commit_count and commit_period: If set up to very low values, the performance of the ingest will be affected due to the high number of transaction commits performed. Usually you should balance the performance and the concurrency to set up these parameters. On the other hand, setting up these parameters to a very high value may negatively affect the DB2 locking mechanism. If you omit both parameters, ingest will commit every 1 second.
num_flushers_per_partition: By default, the INGEST command will compute the optimal value for this parameter based on logical CPUs available. If you see idle CPU resources during the ingest operation and want to force the use of more processing threads, you can try to increase this parameter. For systems with Database Partition Feature using logical partitions (more than one database partition in a single physical server), be aware that this parameter takes effect at partition-level, so the number of threads will be multiplied by the number of logical partitions in the server.
num_formatters and shm_max_size: By default, ingest computes optimal values for these parameters at startup time.
Monitoring the ingest progress
Like any other utility running on DB2, you need to monitor the execution of ingest operations. INGEST LIST and INGEST GET STATS are the two commands used to monitor an ingest operation.
The INGEST LIST command
This is the basic command to check if there is any ingest operation executing on the server. Among the information provided, you can identify the table being loaded, start time, and number of rows processed so far. Another piece of important information is the job ID, which is used to get detailed information about the ingest job, as shown in Listing 16.
Listing 16. The job ID
db2inst2@asdp:~$ db2 ingest list Ingest job ID = DB21001:20130312.143314.008948:00002:00005 Ingest temp job ID = 1 Database Name = SAMPLE Target table = DB2INST2.DEPARTMENT Input type = FILE Start Time = 03/12/2013 14:33:14.406157 Running Time = 00:00:06 Number of records processed = 22000 DB20000I The INGEST LIST command completed successfully.
The INGEST GET STATS command
In order to get more details about the ingest operation, the command INGEST GET STAT is provided. This command is particularly useful to check ingest performance indicators. In the example shown in Listing 17, the command was invoked to collect performance data from job ID=1 every 1 second during the execution of the ingest.
Listing 17. INGEST GET STATS example
db2inst2@asdp:~$ db2 ingest get stats for 1 every 1 seconds Ingest job ID = DB21001:20130312.143701.363666:00002:00005 Database Name = SAMPLE Target table = DB2INST2.DEPARTMENT Overall Overall Current Current ingest rate write rate ingest rate write rate (records/second) (writes/second) (records/second) (writes/second) Total records ----------------- ----------------- ----------------- ----------------- ----------------- 20000 1600 100000 8000 8000 16666 3666 0 14000 22000 14285 3142 0 0 22000 12500 4375 0 13000 35000 11111 5666 0 16000 51000 10000 5900 0 8000 59000 9090 5363 0 0 59000 8333 7083 0 26000 85000 7692 6538 0 0 85000 7142 6714 0 9000 94000 6666 6666 0 6000 100000 DB20000I The INGEST GET STATS command completed successfully.
Comparing ingest to other methods
Prior to IBM DB2 10.1, you had to choose between IMPORT or LOAD utilities to move data into tables. The ingest utility provides an alternative to the other methods as it allies the high throughput of data insertion and high concurrency with ongoing database applications. Table 1 shows a very short comparison between the methods.
Table 1. Comparing ingest to other data movement methods
|Speed||Slow. Use sequential INSERT operations||Fast. Use parallel INSERT operations||Very fast. Data inserted directly into DB2 containers|
|Concurrency||High, although some table level lockings may occur||Very high, use row level locking||Low. Only read operations are allowed|
|Logging||Yes||Yes||No, data inserted is not logged|
|Behavior on failure||Rollback. Table is accessible||Rollback. Table is accessible||Table is put in Loading Pending state. Manual intervention is required|
|LOBs and XML Data||Yes||No||Yes|
Use cases of Continuous Data Ingest
To make the CDI concept clear, the following are some common business situations where the ingest utility may help customers meet their objectives.
Near real-time data warehouses
In databases, specifically data marts and data warehouses, where data must be updated in a near real-time basis, the users cannot wait for the next load window to get fresh new data. Additionally, they cannot afford to have offline LOAD jobs running during business hours.
In these cases, ingest can be used to insert data into the tables without an impact to the running queries or applications. The high concurrency levels provided by ingest virtually eliminate the need for an offline window to populate the database. Additionally, the performance provided by the tool guarantees large amounts of data will be loaded with high throughput. Tests show that ingest performance is comparable to the time spent in standard approach using LOAD into staging table followed by INSERT/SELECTs into target tables.
Populate HADR databases
Using the LOAD utility in HADR databases is particularly complicated. The only way to use LOAD in an HADR database is to specify the COPY YES clause. Nevertheless, it requires a shared file system between primary and standby servers, or manual DBA intervention to keep the HADR in-synch after the LOAD.
Ingest provides a better option in this case as it's fully logged, so HADR will automatically replicate all the inserted rows into the standby server without any user intervention.
Decision support systems are critical to companies, and the number of users accessing BI data is increasing every day. The new data warehouse applications require fresh new data to be available for queries as soon as possible. To support such demand of online analytical data, IBM DB2 10.1 has introduced a new feature called Continuous Data Ingest (CDI). The great benefit of this feature is to provide the ability to insert data into databases and data warehouses with high throughput and minimal impact to ongoing analytical applications.
This article introduced the new ingest tool and explained how you can code your first INGEST commands, how to monitor the execution, and also presented some scenarios where this feature can replace the old methods of loading data into the data warehouses.
- Refer to the IBM DB2 10.1 Information Center, which provides extended references for all the tools and commands presented in this article.
- Read the redbook Unleashing DB2 10 for Linux, UNIX, and Windows, where you'll find information about new features introduced in IBM DB2 10.1.
- Evaluate IBM products in the way that suits you best: Download a product trial, try a product online, use a product in a cloud environment, or spend a few hours in the SOA Sandbox learning how to implement Service Oriented Architecture efficiently.