To improve the relevancy of search results, you can configure IBM® Content
Analytics with Enterprise Search to sample clusters of documents
in the index, and then configure a collection to categorize documents
based on analysis of words in the clusters.
Document categorization that is based on cluster analysis
involves:
- Configuring the system to create clusters by sampling a subset
of documents and extracting words. The result of this document clustering
task is a cluster proposal, which consists of clusters
that contain candidate words for classifying documents.
- Categorizing documents by deploying the cluster proposal. This
process adds metadata to documents based on the cluster analysis and
creates an internal knowledge base. The knowledge base is used to
classify all documents in the index into rule-based categories.
When users query the collection, they can narrow the results
to documents that were categorized when the cluster analysis was deployed.
In addition, if conceptual search is enabled for the collection, users
can search documents that conceptually match their query terms.
If
you enable document clustering after building an index, a full re-build
of the index is required to implement document clustering. If the
collection is configured to use the document cache, you can rebuild
the index without recrawling or re-importing documents.
To categorize documents based on cluster analysis:
- Expand the collection that you want to configure. If support
for document clusters was not enabled when the collection was created,
click Actions to edit the collection settings
and enable document clustering.
- In the Parse and Index pane, click .
- Create a cluster proposal:
- On the Document Clustering Tasks page,
enter a descriptive name for the document clustering task.
- Enter the number of clusters that you want the system
to create by sampling documents. The default value is 100.
- Enter the number of documents that you want the clustering
engine to sample when extracting words and creating clusters. Documents are extracted from the index through random sampling.
The default value is 5000.
- Enter the number of documents to include from an extended
set of samples. If this value is not set, all documents in the text
index for the collection are included. This parameter
is not available if you use the Latent Dirichlet Allocation (LDA)
or K-means algorithms.
- Select the cluster analysis algorithm that you want
to apply. Detailed discussion about these algorithms and
the differences between them is beyond the scope of this document.
In summary:
- Click Start to start the document
clustering task and create the cluster proposal.
- Optional: Add clusters and edit the content
of clusters:
- On the Document Clustering Tasks page,
click Start for the task that you want to refine
and run again.
- On the Edit a Cluster Proposal page:
- Click Add a Cluster to add a row to the
list of clusters in the proposal. You can then specify additional
words that you want to use for categorizing documents.
- Add and remove candidate words. To remove a word from the cluster,
select the word from the list and click Remove.
To add a word to the cluster, type the word in the provided field
and click Add a Word.
- Click OK to apply your changes
to the cluster proposal.
- Optional: Rename clusters and remove clusters:
- On the Document Clustering Tasks page,
click Edit for the task that you want to modify.
- On the Edit a Cluster Proposal page:
- Change the names of any clusters that you want to rename.
- Click Delete for any clusters that you
want to remove from the cluster proposal.
- Click OK to apply your changes
to the cluster proposal.
- Configure the cluster deployment:
- In the Parse and Index pane, click .
- Specify how you want to apply the cluster proposal to
categorize documents in the index:
- Enter a label for the category that is to be displayed in the
facet navigation pane of the content analytics miner or
enterprise search application.
- Select the cluster proposal that you want to deploy.
- Select the policy that you want to use for categorizing documents
into clusters. You can apply the words in the top relevant cluster
as metadata to documents in the index, apply words from clusters with
relevance scores that are above a threshold that you specify, or apply
words from the top relevant cluster that is above a threshold that
you specify. If a document does not meet the specified criteria, it
is not categorized.
- Optional: Add the cluster scores to the index
so that they can be used for category-based ranking or conceptual
search. In the Search pane or Analytics pane
for your collection, click and ensure
that the options for category scores to influence the search results
are enabled.
- Start the cluster deployment task to categorize documents
in the index. Expand Global Processes in the Parse
and Index pane, and then click Start to
start the Cluster deployment task. The
progress of the task is displayed.