No replies
17917 Posts

Pinned topic A quick overview of DB2′s INGEST

‏2013-03-06T03:26:44Z |
INGEST from DB2 10 is fast and it enables us to continuously load data while giving high availability and recoverability. INGEST is a cross between LOAD and IMPORT. INGEST is optimized to be multithread to process data in parallel. Unlike LOAD and IMPORT, INGEST can run continuously reading a data stream through a pipe. This is great, if you are loading live feeds of data from a social media stream or from your production stream. Another feature of INGEST is, it can read multiple files as its source to load the data. Apart from the traditional INSERT statement, it can do UPDATE, MERGE and DELETE. This comes in handy if the data you are loading contain updates to existing rows .
How does it work ?

INGEST has three threads/parts :

1. Transporter : This thread reads from the input files or data stream pipe s in parallel and sets the thread to formatter. DB2 spawns one thread for each input files.

2. Formatter: Formatter threads formats and prepares the data to be send to Flusher. Formatter threads make sure the data is set in such a way that INGEST can insert into DB2 in parallel. The number of formatter threads is specified by the num_formatters configuration parameter.

3. Flusher: Flusher threads insert the data into the DB2 table. If you are using INGEST to load into a partitioned database, then a flusher thread is associated with each partition. There could be multiple flushers running to increase the throughput of the data ingest. The number of flushers for each partition is specified by the num_flushers_per_partition configuration parameter.

Like I said its completely designed to be run efficiently and give high availability.

Can it recover from a failure ?

Absolutely ! INGEST is capable of restarting from where it failed. It literally uses INSERT statements to load the data. With INGEST, you can specify dumpfile to place any rejected rows due to formatting error or invalid character, exception table to place any rows with constraint violations, warning counts to stop if the threshold is met. You can also specify commit count in rows or by time. This comes in handy when loading data continuously from a stream. If the INGEST utility gets aborted for any reasons (power failure or somebody accidentally killing the thread, you can restart the INGEST with RESTART option and DB2 will resume the INGEST from the last commit point. If do not want to restart INGEST you can issue the RESTART TERMINATE and DB2 will clean up the failed INGEST. Unlike with the TERMINATE parameter of the LOAD command, data already committed by the failed INGEST command is not removed from the table.

What all format can it handle ?

INGEST utlitiy supports the following input data formats:

Delimited text
Positional text and binary
Columns in various orders and formats

How about performance ?

Obviously this is one of the most sacred part for the DBA when choosing the option to load lot of data. The best performance is always LOAD, but it doesn’t give high availability. INGEST even though under the covers issues INSERTS, I have noticed that when loading hundred thousands of rows INGEST was able to perform 40% faster than traditional INSERT. The reason why I compared to INSERT, is because its the only option that gives me high availability.

How is it run ?
...continue reading on