Bulk Import Tool overview
Bulk Import Tool imports documents in bulk into an object store. The documents are usually in the form of TIFF images and are generated by an external process.
The documents are imported as a batch process. The location of the documents to be imported is specified in the transact.dat file. They can be in the same directory as the transact.dat file, in a subdirectory, or in a different location.
The Bulk Import Tool is multi-threaded, so you can run multiple batches in which each thread processes batches simultaneously. Threads can be set to three by default or to another value by using the -CE switch. If you run a single batch of 100,000 items, one thread processes all of the items in that batch. The other two threads sit idle while the batch is processed.
The location of the transact.dat file is specified in the batchname.eob file. The batchname.eob file is in the working directory. The transact.dat file is used to tell the Bulk Import Tool how to process the batches of documents. The transact.dat file provides information that describes the makeup of each document. The transact.dat file also provides a class code. This code associates the document with a document class, property values to assign to the document, and content description paths, such as the location of the document content. All of the information is used in Bulk Import Tool operations and can contain definitions to different document classes in the same batchname file.
If a batchname.lck file exists and a new batchname.eob file is added with the same batchname, the .lck file will cause a reprocess of the previous batch followed by the new processing of the batch specified with the batchname.eob file. Ensure that previously incomplete batches are properly recovered before introducing new batches using the same name.
- Using smaller batch sizes and increasing the number of threads that each instance of the Bulk Import Tool processes.
- Running multiple instances of the Bulk Import Tool on separate machines/JVM/CPU to reduce the risk of a system bottleneck.
- Using a shared working directory, such as in a cluster, to have one location in which to place batches. All instances of the Bulk Import Tool would poll that location for work.