Open, scalable analytics pipeline

As content is crawled, documents and records are fed into the Watson Content Analytics analytics pipeline.

Each item is converted to text and sent through a series of processing and analysis steps. Each step annotates the content item with additional information and cleans, clarifies, and extracts meaning from the item.

As provided, Watson Content Analytics includes a powerful set of annotators to annotate and extract meaning from content. The annotators:

Detect source language and character encoding of the content
Tokenize the text into words
Identify the parts of speech of different words
Find meaningful word phrases, such as noun phrases or adjective-noun pairs
Extract named entities from the text automatically, such as people, locations, and organizations
Automatically categorize and classify the content items through IBM® Content Classification
Find custom patterns in the text through customer-defined regular expressions
Search for relevant custom-defined dictionary terms, such as product names or brands

Because Watson Content Analytics implements the open Unstructured Information Management Architecture (UIMA) framework, you can fully customize the processing of content and take advantage of existing UIMA annotators, both open source and commercial, to enhance your automated content analysis capabilities.

To develop custom annotators, you can use IBM Watson Content Analytics Studio as an alternative to manually developing annotators with the UIMA software development kit (SDK). IBM Watson Content Analytics Studio is a separately installable component of Watson Content Analytics.

Deep, automated analysis of content can be a computing-intensive operation. To support fast and efficient analysis of large amounts of content, Watson Content Analytics provides a highly scalable implementation of the UIMA specification, which allows you to distribute content analysis across multiple machines.