Bulk Import Tool overview

Bulk Import Tool imports documents in bulk into an object store. The documents are usually in the form of TIFF images and are generated by an external process.

The documents are imported as a batch process. The location of the documents to be imported is specified in the transact.dat file. They can be in the same directory as the transact.dat file, in a subdirectory, or in a different location.

The Bulk Import Tool is multi-threaded, so you can run multiple batches in which each thread processes batches simultaneously. Threads can be set to three by default or to another value by using the -CE switch. If you run a single batch of 100,000 items, one thread processes all of the items in that batch. The other two threads sit idle while the batch is processed.

The location of the transact.dat file is specified in the batchname.eob file. The batchname.eob file is in the working directory. The transact.dat file is used to tell the Bulk Import Tool how to process the batches of documents. The transact.dat file provides information that describes the makeup of each document. The transact.dat file also provides a class code. This code associates the document with a document class, property values to assign to the document, and content description paths, such as the location of the document content. All of the information is used in Bulk Import Tool operations and can contain definitions to different document classes in the same batchname file.

Important: You can use the same batchname.eob file for multiple runs, but running batches with unique batch names is recommended. If you run Bulk Import Tool with a batchname.eob file that was used in a previously completed run and the batchname.rpt file from that run remains in the working directory, Bulk Import Tool does not remove the previously imported documents from the Content Platform Engine system. If you remove the .rpt file before you run the batch with the same batchname.eob, the .pass, .confirm, and .err files are still overwritten in the external locations.

If a batchname.lck file exists and a new batchname.eob file is added with the same batchname, the .lck file will cause a reprocess of the previous batch followed by the new processing of the batch specified with the batchname.eob file. Ensure that previously incomplete batches are properly recovered before introducing new batches using the same name.

To increase throughput, you can adjust some of the system settings. Each system has unique attributes so a particular group of settings might not work for every system. Some of the following things that can increase throughput.
  • Using smaller batch sizes and increasing the number of threads that each instance of the Bulk Import Tool processes.
  • Running multiple instances of the Bulk Import Tool on separate machines/JVM/CPU to reduce the risk of a system bottleneck.
  • Using a shared working directory, such as in a cluster, to have one location in which to place batches. All instances of the Bulk Import Tool would poll that location for work.