IBM DB2® Information Integrator for Content provides an Information Mining service that converts information that is implicit in unstructured documents to valuable metadata. This article gives you an overview of how to optimize the Information Mining service for performance. It is organized according to the basic tasks that can be performed with Information Mining which are:
- Automatically deriving metadata from text documents (text analysis)
- Storing this metadata in a repository (persistence)
- Retrieving data from the repository (advanced search)
The performance of document filtering is outside the scope of this document. In general, documents with a complex binary format such as Microsoft© Word or PDF are more difficult to process and therefore preprocessing them takes more time than for simple text based formats.
Reading and understanding this article requires at least a basic knowledge of information mining technology and concepts. An earlier article on the DB2 Developer Domain, "EIP Information Mining in a Nutshell," is a good place to start.
In general, the time it takes to perform the following functions grows linearly with the size of the document to be processed:
- Language identification
- Summarization
- Information extraction
- Categorization
- Adding a document to the set of documents to be clustered
The reason for this linear increase in processing time is that each of these functions involves traversing the document and analyzing its linguistic elements with different levels of detail. Language identification is a special case since its processing is restricted to the first 1024 bytes of the document1. The other document analysis functions always process the whole document independent of its actual size.
Though analysis time will in the majority of cases not be a problem as it is very fast for small to medium size documents, it can be a significant factor when processing huge book-type documents in an interactive environment. One such example would be real-time categorization or summarization of manuals or reports with hundreds of pages.
The obvious solution in most cases would be to let a non-interactive application perform all required analysis functions in a separate step either during document import or as a regular batch task and store the results in a database, the so-called metadata store. Accessing the metadata associated with a document then amounts to a simple lookup of information from the metadata store.
If the batch analysis technique is not feasible for some reason (for example, summarization may need to be done on-line), then a valuable alternative is to restrict the processing to a portion of the document. The key question is how to identify a portion that still conveys enough of the important content. In-depth processing of the content to find the proper subset is not an option as this would lead us back to the performance problem we are trying to solve.
If the document to be processed already contains some sort of abstract or summary, restricting the analysis function to this part of the document may be a good choice. If this is not possible or obtaining it would be difficult or costly, you can try a simple and straightforward approach by selecting a fixed-length prefix from the document body. You can use the service API to do this by creating a new DKIKFTextDocument which contains a copy of the first X bytes of the original document's content and processing it instead of the original document.
When using this approach, we strongly recommend that you experiment with different prefix sizes based on a representative subset of the documents to be processed. Carefully compare the results obtained from the new 'pseudo document' with those of the full document to ensure they still meet your quality objectives.
If you're applying summarization to a large document along with other text analysis functions, the overall processing time can be reduced by running summarization first and applying any further text processing functions to the summary instead of applying it to the whole document. The exception would be language identification which may have already been run as a prerequisite of summarization. As with the prefix-based approach, a sample evaluation may be useful to check the acceptability of the result and find the proper value for the size of the summary.
It is, however, important to understand that the summarization function is not optimized for large volume multi-chapter documents covering multiple themes such as, for example, conference proceedings. In addition to the size of the document, the thematic consistency of its content is also an important factor when considering the speed versus quality tradeoff. Summarizing a well-selected subset of a multi-chapter document may be the preferred option for large documents that cover a variety of themes even if processing time would be acceptable. For categorization the entire document might still be the right choice to ensure all relevant categories are covered.
Special hints for specific functions
Categorization
Categorization is based on a two-step process:
- Step 1 builds a categorization model from sets of training documents which are assigned to categories in a taxonomy.
- Step 2 applies the categorization model to a document to assign it to a category or set of categories from this taxonomy.
Step 1 is performed with the help of the Information Structuring Tool (IST), which is a Web application. As a consequence, it is important to understand when and what amount of information is sent back and forth between the Information Mining server and the Web-based client to be able to use the IST in a way that minimizes traffic on the network. In general, the IST client tries to keep as much information as possible locally to minimize communication traffic. The following list provides hints that may be useful to optimize the way in which the IST works in your environment:
- When creating a big taxonomy (more than 10 categories, multi-level), do not create the categories manually using the IST GUI, but instead create a directory structure on the client where the directory names are the names of the categories and the directories contain the corresponding training documents. Then use the upload function of the IST to create the whole taxonomy in a single step.
- It is important to understand that the categorization service is optimized for fast execution of step 2 with the tradeoff that building the categorization model is a more expensive task. This is achieved by representing the logic that decides whether a given document belongs to a category in a format that allows efficient lookup and application. As a consequence, the evaluation function of the Information Structuring Tool may be somewhat expensive for large size taxonomies since it carries out the training task in an interactive environment (the IST).
When working with big taxonomies, the preferred approach to evaluation is to start with a taxonomy skeleton which only consists of the major categories, evaluate and refine it and add lower level categories when the results obtained with the skeleton are acceptable. This avoids the evaluation of large amounts of data (which may become a non-interactive process) and getting lost with details too early instead of focusing on the appropriateness of the overall structure first.
- If interactive evaluation is at issue, it may be an option to reduce the number of iterations performed. Each iteration cycle takes approximately 80% of the time used for training (assuming all training documents are of similar size). Reducing the number of iterations may lower the accuracy of the evaluation result since the number of documents observed is smaller, but it takes only a fraction of the time. Whether the results are still representative mostly depends on the homogeneity of the training documents. If the training documents of a category are of a very heterogeneous nature, that is, they rely on different terminology, we recommend that you use no fewer than 3 to 5 iterations.
Even if a smaller number can be used, we recommend that you run a complete evaluation (5 iterations) from time to time (for example as an over-night job) to ensure evaluation results do not misrepresent reality.
Clustering
To be able to interpret the performance behavior of the clustering function it may be useful to understand that clustering is based on an algorithm that iteratively applies a function to the feature space with the goal of detecting a (local) minimum. The number of iterations that take place until the minimum is found depends on a good selection of the start value. Since this start value is determined randomly, it is difficult to predict the exact duration of a clustering job. However, here are some facts that may be helpful:
- Clustering performance scales linearly with the size of the document collection and quadratic with the number of clusters. In general, the number of clusters is an outcome of the clustering step, but there is a way to restrict the maximum number of result clusters. We recommend using this to ensure clustering time does not take considerably longer than expected. If no limits are set, the only strict upper limit of the number of clusters is the size of the document collection to be clustered.
- Even small changes to the document collection (such as adding or removing a single document) may have a significant impact on performance since a modification of the feature space usually moves the local minimum which may imply a larger number of iterations. Though tests have shown that small changes tend to have a minor impact, you should be aware of the fact that this may actually happen.
- Clustering performance on the same data can vary between different platforms as the randomly selected start values are platform-dependent.
Storing and retrieving metadata
Typical applications may want to use mining results in different ways without re-running the mining operations multiple times on the same documents. Therefore the results must be made persistent. If the content repository supports text search (such as in the case of DB2 Content Manager) the preferred option is to store all required metadata in this repository along with the original document and other metadata.
This may be especially useful in a distributed environment where documents are located close to the application while the administration database resides on a central server at a remote location. Storing metadata on a system that has a high-speed connection to the content repository (such as CM Object Server) ensures that metadata-based content retrieval works as fast as an ordinary query on the content server. However this approach requires that you build your own data model.
If storing the metadata in the content repository is not an option you may want to use the built-in metadata store provided by Information Mining. This metadata store is based on the IBM Content Manager programming model which allows the storage and retrieval of all persistent data, including:
- Document metadata
- Categories
- Training documents
- Catalogs
All data access uses the IBM Content Manager V8 JavaTM Connector provided by IBM DB2 Information Integrator for Content. The model and user data are stored in the Information Integrator for Content administration database, which is a DB2 Universal DatabaseTM database. That's why a lot of the performance considerations that will be discussed are directly related to database optimizations.
A benefit of using the built-in metadata store is that you can take advantage of the advanced search capabilities of Information Mining.
The advanced search looks for text in documents stored in the catalog restricted to particular categories.
Search configuration
The search configuration can be used to specify additional search properties which may have a great impact on search performance improvements. The search configuration is accessible on the Service API level via the DKIKFSearchConfiguration class and on the Java Beans level via the CMBAdvancedSearchService bean. To optimize an advanced search query by means of tailoring the search configuration, follow these guidelines:
- Reduce the amount of data retrieved. You can specify the schema key values to be retrieved by setting the keys in the search configuration. This way only the metadata that are actually required by your application need to be fetched from the data store. You may know this projection feature from a standard SQL SELECT statement where you can restrict the output columns [1].
- Reduce the number of records retrieved by setting the maxResults parameter in the search configuration. This limits the maximum number of search results to the specified value.
- Adjust the size of the internal result buffer. Search results are retrieved from the database in chunks of a configurable size. When iterating over the search results and accessing the first record that has not yet been retrieved, another trip to the database is performed. Therefore if you provide for example a graphical user interface to display the search results, adjust the chunk size of the result buffer so that it fits into the display area of the result pane. This can be done by setting the resultBufferSize parameter in the search configuration.
Advanced search query properties
This section describes guidelines for how to build advanced search queries:
- Avoid retrieving LOB attributes. When retrieving a record containing one or more LOB columns, such as SCHEMA_KEY_CONTENT, an additional trip to the database server is necessary for each LOB column. This will decrease performance significantly. Thus you should avoid retrieving these schema keys. (Also refer to 'reduce the amount of data retrieved' above.)
- Avoid using the '>=' operator for category searches. If you search for documents by category avoid to using the '>=' operator along with high-level categories such as categories that have many child categories in the taxonomy tree. That's because the query internally has to be expanded to the complete sub-tree. Therefore try to exclusively use the equals ('=') operator if applicable.
- Use text-search queries. Generally you should have a database index for each column that appears in a SQL WHERE clause to avoid table scans when submitting a query against a database. According to the IBM Content Manager programming model each of the Information Mining schema keys maps to a database column. Because of the type and the size of the schema keys (please refer to [2]) DB2 indexes cannot be created (except for the integer and timestamp type). Therefore Information Mining advanced search was designed to perform best if a query contains at least one full-text search argument for the SCHEMA_KEY_CONTENT schema key. The underlying text search engine is tightly integrated into IBM DB2 UDB; thus the optimizer will optimize the complete generated SQL query.
- Avoid search patterns using the LIKE operator in your query because this is basically mapped to the corresponding SQL LIKE predicate that will slow down performance due to well-known reasons.
- Cache the taxonomy you're working with. Retrieving a complete taxonomy tree from the persistent store is a time-consuming task. Therefore keep the taxonomy handle in memory and re-use it as long as the taxonomy has not been changed. To ensure your taxonomy cache is up-to-date you need to check the timestamp of the taxonomy. For details see the Application Programming Guide (refer to
DKIKFTaxonomy). Note that the timestamp is always read from the database so it always reflects the proper state of the taxonomy.
The Infomining.properties file, on Windows© platforms located in the <CMBROOT>\ikf\lib directory and on other platforms in the corresponding directory, contains some configurable values that can be modified in order to optimize for performance:
- Adjust CommitSize. In the 'General' section the CommitSize parameter value (default=50) determines the number of rows that are deleted in a single transaction. This value is internally used whenever a category sub-tree in a taxonomy is deleted or when cleaning up the system (refer also to EIP Information Mining in a Nutshell). Increasing the value might improve performance because the administrative overhead for starting and committing a transaction will be reduced.
Note that if you choose a large value for the CommitSize parameter you may have to increase the log file size (logfilsiz) parameter value and/or related parameter values in your database configuration in order to prevent the database log file to overflow. Also note that cleaning up the system should be done during computer peak off times. - Adjust text index update frequency. You can change the default update frequency by modifying the entries in the Search_Index section. The names should be self-explanatory. Decreasing the update frequency will improve performance in two ways:
- First the daemon process that triggers the incremental updates will spawn the update process less frequently (remember, one update process per text index, that is, per catalog). This reduces the system load in terms of processes started, index access, and locking on the database level.
- The second aspect to consider is that inserting only a few new documents to an index is less effective than a 'bulk' update where a lot of documents are added to the index.
Please also note that you have to modify the Search_Index values before you create a new catalog, otherwise the update frequency will not be affected [1]. For a detailed description please refer to the [5] or [6].
Whenever Information Mining catalogs are created - and therefore new tables are generated - or when documents are imported, you should consider updating the DB2 system catalog tables containing the statistics about the number of rows in a table, the use of space by a table or index, and other similar information. This information is not dynamically kept current; therefore to have efficient access paths to your data you should regularly use the RUNSTATS utility and rebind the packages [4].
Assuming that you are logged on as user icmadmin using password password and you are connected to the database icmnlsdb, the following commands generate a Windows command file that can be executed in order to update statistics for all tables in the icmnlsdb database2:
db2 -x "select 'db2 runstats on table ' concat rtrim(tabschema)concat '.' concat tabname concat ' with distribution and detailed indexes all ' from syscat.tables where tabschema='ICMADMIN' and type='T'" > runStatsAndRebindScript.cmd <strong>[1]</strong> echo db2 connect reset >> runStatsAndRebindScript.cmd <strong>[2]</strong> echo db2rbind icmnlsdb -l logfile all -u icmadmin -p password >> runStatsAndRebindScript.cmd <strong>[3]</strong> |
In the first line ([1]) we select from a system catalog table all table names that belong to the specified schema and generate the RUNSTATS command for each of them. The output is redirected to the command file. In the next line ([2]) we generate the command to explicitly disconnect from the database. In the last line ([3]) the command to rebind all packages in the database is created. A rebind is recommended after RUNSTATS. The resulting command file will look something like this:
db2 runstats on table ICMADMIN.ICMUT01000001 with distribution and detailed indexes all db2 runstats on table ICMADMIN.ICMUT01001001 with distribution and detailed indexes all ... db2 connect reset db2rbind icmnlsdb -l logfile all -u icmadmin -p password |
Running the generated command file will update the statistics for all tables with the given schema in the database. This is because the DB2 table names are unknown to Information Mining since they are created by Content Manager, and the table names do not map to the item type names.
Note that the command file does not reorganize the database tables. You may want to extend the command file to perform the REORG command before calling RUNSTATS.
Additionally to optimize storage and increase performance, you should also reorganize the text indexes periodically, and especially after large updates have been made to the indexes, for example when a lot of documents have been imported into a catalog. Information Mining provides a script in the directory <CMBROOT>\ikf for all platforms to run the index reorganization. This task is best started at computer off-peak times. To see whether an index reorganization is required you can query the text search engine catalog tables (see references [5] or [6] for details).
Performance considerations for distributed environments
If you want to repeatedly run a sequence of mining functions in a distributed environment, we strongly recommend using the ServerTask concept which helps to reduce network traffic significantly. The method runServerTask() of a class that implements the DKIKFServerTask interface can be used to bundle these steps. When executing the server task, the set of functions is executed on the server without the need for passing document content and analysis results back and forth between the client and server. For details see the section "Running a server task" in the chapter "Working with information mining" of the Information Integrator for Content Application Programming Guide.
This paper shows how the Information Mining Service of IBM DB2 Information Integrator for Content can be used to get maximum performance in different application scenarios. In the first part we described how to optimize the performance of text mining functions when dealing with large documents or document collections while part 2 focuses on optimizing the built-in metadata store. While there is no general tuning strategy, the hints we've talked about should serve as a guideline towards optimal performance for a specific usage scenario.
1Please note that this emphasizes the importance of proper pre-processing of documents with proprietary formats as in a worst-case situation this entire part may consist of data which is not considered part of the document content.
2The authors and their employer disclaim all warranties with regard to this software, including all implied warranties of merchantability and fitness. In no event shall the authors or their employer be liable for any special, indirect or consequential damages or any damages whatsoever resulting from loss of use, data or profits, whether in an action of contract, negligence or other tortious action, arising out of or in connection with the use or performance of this software.
-
[1] EIP Information Mining in a Nutshell, White Paper, Rolf Baurle, Matthias Tschaffler, IBM Corp. Dec 2002
- [2] IBM Content Manager / IBM Information Integrator for Content, Workstation Application Programming Guide, Version 8, Release 2 (SC27-1347-01)
- [3] Content Manager Version 8.2 Performance Tuning Guide (http://www.ibm.com/support)
- [4] IBM DB2 Universal Database, Administration Guide: Performance, Version 8 (SC09-4821-00)
- [5] DB2 Universal Database, Net Search Extender Administration and User's Guide, Version 8.1 (SH12-6740-00)
- [6] DB2 Universal Database, Text Information Extender Administration and User's Guide, Version 7.2 (SH12-6732-00)

Peter Gerstl joined the IBM's development Lab in Boeblingen, Germany in 1997 working on various project management, development and consulting roles for text mining, search and content management. His current work focus is on the systematic performance and quality analysis of text mining functions in the IBM Information Integrator for Content.
Comments (Undergoing maintenance)






