Component overview

The system components collect data from throughout your enterprise; parse, analyze, and extract meaning from the information; and create a text index that users can query.

A collection represents the set of sources that users can search and mine with a single query. When you create a collection, you specify which sources you want to include and configure options for how users can query the indexed data.

You can create multiple collections, and each collection can contain data from various data sources. For example, you might create a collection that includes documents from IBM® DB2® and IBM Content Manager Enterprise Edition databases, or a collection that includes documents from IBM FileNet® P8 object stores and Microsoft SharePoint repositories. When users query a collection, the results potentially include documents from each of the data sources.

To help you get started quickly, Watson Explorer Content Analytics provides several feature packages. Each package contains predefined configuration settings, typically designed for a specific purpose or industry. When you create a collection by selecting a feature package, the settings are automatically applied. If the package includes sample data, the parsing and indexing processes begin automatically.

The type of collection that you create determines which functions are available for configuring the collection:
Enterprise search collection
These collections support search and retrieval functions, including the ability to browse and narrow results by selecting facets, sort documents by relevance or date, preview documents in the search results, and view thumbnail images of certain types of documents. You can choose to enable some analytic features for searching these collections, such as the ability to see correlation scores and how results flow along a timeline chart.
Content analytics collection
In addition to search, these collections support content mining functions, such as the ability to explore correlations, deviations, and trends in your data. You can also export analysis results to data warehouse or business intelligence applications, and generate reports that can be saved in comma-separated value (CSV) format or opened with IBM Cognos® Business Intelligence tools.
Creating and administering a collection involves the following activities:
Collecting data
The crawler components collect documents from data sources, either on a continual basis or according to a schedule that you specify. Frequent crawling ensures that users always have access to the latest information. In addition to crawling data sources, you can add content to a collection by importing CSV files.
Analyzing data
The analytics pipeline extracts text from documents, does linguistic analysis, finds meaningful word and phrases, extracts entities, and performs custom analysis on each document. The detailed content analysis provides facets of data that can be used for exploring the content.
Indexing data
The index components add data from new and changed documents to the index. The index components also do global analysis of the documents in a collection to determine correlation scoring or to detect duplicate and nearly duplicate documents. In a content analytics collection, a separate index can be created for facets. You can also create an overlay index to exclude words and phrases from the search results.
Searching and mining content
An enterprise search application provides an interactive graphical interface for finding and retrieving specific documents. The content analytics miner provides an interactive graphical interface for exploring analyzed content to discover relationships and anomalies.