IBM Content Analytics with Enterprise Search, Version 3.0.0

Configuring and deploying document clusters

To improve the relevancy of search results, you can configure IBM® Content Analytics with Enterprise Search to sample clusters of documents in the index, and then configure a collection to categorize documents based on analysis of words in the clusters.

Document categorization that is based on cluster analysis involves:

Configuring the system to create clusters by sampling a subset of documents and extracting words. The result of this document clustering task is a cluster proposal, which consists of clusters that contain candidate words for classifying documents.
Categorizing documents by deploying the cluster proposal. This process adds metadata to documents based on the cluster analysis and creates an internal knowledge base. The knowledge base is used to classify all documents in the index into rule-based categories.

When users query the collection, they can narrow the results to documents that were categorized when the cluster analysis was deployed. In addition, if conceptual search is enabled for the collection, users can search documents that conceptually match their query terms.

If you enable document clustering after building an index, a full re-build of the index is required to implement document clustering. If the collection is configured to use the document cache, you can rebuild the index without recrawling or re-importing documents.

To categorize documents based on cluster analysis:

Expand the collection that you want to configure. If support for document clusters was not enabled when the collection was created, click Actions to edit the collection settings and enable document clustering.
In the Parse and Index pane, click Configure > Global processing > Document clusters.
Create a cluster proposal:
1. On the Document Clustering Tasks page, enter a descriptive name for the document clustering task.
2. Enter the number of clusters that you want the system to create by sampling documents. The default value is 100.
3. Enter the number of documents that you want the clustering engine to sample when extracting words and creating clusters. Documents are extracted from the index through random sampling. The default value is 5000.
4. Enter the number of documents to include from an extended set of samples. If this value is not set, all documents in the text index for the collection are included. This parameter is not available if you use the Latent Dirichlet Allocation (LDA) or K-means algorithms.
5. Select the cluster analysis algorithm that you want to apply. Detailed discussion about these algorithms and the differences between them is beyond the scope of this document. In summary:
  - If you select Latent Dirichlet Allocation - detect clusters by samples, learn by all, the LDA algorithm is applied to a set of sampled documents (the number of samples specified in the Number of samples field). Then, classification of a larger set of documents (the number of documents specified in the Number of clustered documents field) is done by detecting clusters and training a knowledge base. This knowledge base is used to classify an entire collection of documents, based on the sample documents, and it is used to support conceptual search. By applying this algorithm, a knowledge base learns more words than when the LDA algorithm is applied on its own.
  - If you select Latent Dirichlet Allocation - detect clusters in partitions, learn by all, the LDA algorithm is applied to a set of sampled documents (the number of samples specified in the Number of samples field). Then, this base set of detected clusters is refined by more documents (the number of documents specified in the Number of clustered documents field). By applying this algorithm, clusters are refined by a larger number of documents and a knowledge base learns more words than when the LDA algorithm is applied on its own.
    Restriction: If you configure document clustering for a collection that runs on an IBM InfoSphere BigInsights server, this is the only algorithm that you can apply.
6. Click Start to start the document clustering task and create the cluster proposal.
Optional: Add clusters and edit the content of clusters:
1. On the Document Clustering Tasks page, click Start for the task that you want to refine and run again.
2. On the Edit a Cluster Proposal page:
  - Click Add a Cluster to add a row to the list of clusters in the proposal. You can then specify additional words that you want to use for categorizing documents.
  - Add and remove candidate words. To remove a word from the cluster, select the word from the list and click Remove. To add a word to the cluster, type the word in the provided field and click Add a Word.
3. Click OK to apply your changes to the cluster proposal.
Optional: Rename clusters and remove clusters:
1. On the Document Clustering Tasks page, click Edit for the task that you want to modify.
2. On the Edit a Cluster Proposal page:
  - Change the names of any clusters that you want to rename.
  - Click Delete for any clusters that you want to remove from the cluster proposal.
3. Click OK to apply your changes to the cluster proposal.
Configure the cluster deployment:
1. In the Parse and Index pane, click Configure > Global processing > Cluster deployment.
2. Specify how you want to apply the cluster proposal to categorize documents in the index:
  - Enter a label for the category that is to be displayed in the facet navigation pane of the content analytics miner or enterprise search application.
  - Select the cluster proposal that you want to deploy.
  - Select the policy that you want to use for categorizing documents into clusters. You can apply the words in the top relevant cluster as metadata to documents in the index, apply words from clusters with relevance scores that are above a threshold that you specify, or apply words from the top relevant cluster that is above a threshold that you specify. If a document does not meet the specified criteria, it is not categorized.
Optional: Add the cluster scores to the index so that they can be used for category-based ranking or conceptual search. In the Search pane or Analytics pane for your collection, click Configure > Search server options and ensure that the options for category scores to influence the search results are enabled.
Start the cluster deployment task to categorize documents in the index. Expand Global Processes in the Parse and Index pane, and then click Start to start the Cluster deployment task. The progress of the task is displayed.

Feedback