Analyzing clusters

You can build and explore concept clusters in the Clusters view (View > Clusters). A cluster is a grouping of related concepts generated by clustering algorithms based on how often these concepts occur in the document/record set and how often they appear together in the same document, also known as cooccurrence. Each concept in a cluster cooccurs with at least one other concept in the cluster. The goal of clusters is to group concepts that co-occur together while the goal of categories is to group documents or records based on how the text they contain matches the descriptors (concepts, rules, patterns) for each category.

A good cluster is one with concepts that are strongly linked and cooccur frequently and with few links to concepts in other clusters. When working with larger datasets, this technique may result in significantly longer processing times.

Clustering is a process that begins by analyzing a set of concepts and looking for concepts that cooccur often in documents. Two concepts that cooccur in a document are considered to be a concept pair. Next, the clustering process assesses the similarity value of each concept pair by comparing the number of documents in which the pair occur together to the number of documents in which each concept occurs. See the topic Calculating Similarity Link Values for more information.

Lastly, the clustering process groups similar concepts into clusters by aggregation and takes into account their link values and the settings defined in the Build Clusters dialog box. By aggregation, we mean that concepts are added or smaller clusters are merged into a larger cluster until the cluster is saturated. A cluster is saturated when additional merging of concepts or smaller clusters would cause the cluster to exceed the settings in the Build Clusters dialog box (number of concepts, internal links, or external links). A cluster takes the name of the concept within the cluster that has the highest overall number of links to other concepts within the cluster.

In the end, not all concept pairs end up together in the same cluster since there may be a stronger link in another cluster or saturation may prevent the merging of the clusters in which they occur. For this reason, there are both internal and external links.

  • Internal links are links between concept pairs within a cluster. Not all concepts are linked to each other in a cluster. However, each concept is linked to at least one other concept inside the cluster.
  • External links are links between concept pairs in separate clusters (a concept within one cluster and a concept outside in another cluster).
Figure 1. Clusters view
Clusters view

The Clusters view is organized into three panes, each of which can be hidden or shown by selecting its name from the View menu:

  • Clusters pane You can build and manage your clusters in this pane. See the topic Exploring Clusters for more information.
  • Visualization pane You can visually explore your clusters and how they interact in this pane. See the topic Cluster Graphs for more information.
  • Data pane You can explore and review the text contained within documents and records that correspond to selections in the Cluster Definitions dialog box. See the topic Cluster Definitions for more information.