When streamlining the ingestion of structured and unstructured data is a big deal, the
enrichment capabilities in IBM® Watson® Discovery Service are huge. The reason—its built-in
enrichment algorithms can process your data sources and help find signal at the implicit level as
well as the explicit level.
This means Discovery is not your traditional keyword-based enrichment tool.
Rather, the information you can extract from unstructured text might be something that wasn’t
even referenced explicitly in the text.
Most important, this capability is automated—and built into Discovery. That
means taking advantage of this power doesn’t require you to write code or install the plumbing.
So it takes less time to find signal, saving money and other resources.
As an example, the sentiment analysis API already exists in IBM Watson. It gives you the ability
to find out whether people were talking positively or negatively about people, companies or
some other entity.
Likewise, other powerful APIs are also available in Watson. The concept tagging API, which can
extract an overarching concept that connects explicitly mentioned text, already exists. Watson
also includes the entity tagging API, which can tie pronouns back to something that was
previously referenced in the text to create a connection that would otherwise not exist.
The data crawler functions like a standalone program that uploads all of the files in a directory,
or set of directories—from a local machine or a network file system. Further, Discovery
includes connectors that talk to the different data repositories.
The difficulty is that constructing the plumbing with the API calls and associated coding can be
time consuming and costly. And the learning curve associated with building the code
infrastructure may be daunting. Discovery comes with that plumbing
already built—so these enrichments are ready to use.
Additional enrichment tools
But Discovery Service enrichment capabilities are not limited to the automatic natural
language processing on your documents. As your solution grows, you can use the Natural Language
Understanding service available in IBM Bluemix to enrich content that's outside of your data pipeline.
You can also improve the accuracy and specificity of your enrichments using custom models created from
IBM Watson Knowledge Studio. So if you want to train your own custom model based on a specific domain
such as legal or finance, now you can.
You would just use Watson Knowledge Studio to create custom models, then publish them into Discovery
so they can act on the ingested data. And this additional capability can be applied to Natural Language
Understanding enrichments as well. So you can use this capability to improve both enrichment functions.
Other data processing capabilities
Like a print preview on a Microsoft Word document, the preview API returns the results of the
enrichment to the user rather than the search index. This gives you a sandbox to evaluate an
enrichment and test different configurations. Once you are satisfied with the setup, you can
switch to the regular ingestion API and send the results to the search index.
With the enrichments in place, an ingested document can be normalized. As part of
normalization, the data structure in Discovery gives your flexibility as to how
you’d like to represent the data prior to it being indexed.
This flexibility is because Discovery can represent hierarchal relationships such
as parent-child or sibling-sibling in the data structure. This isn’t possible in a traditional flat file
structure. And this capability allows query options that otherwise couldn’t be executed.
So let’s revisit our fictitious online reseller to see what this means in practice. Discovery
has already ingested multiple Blue Snail Style data sources, including the highly
descriptive product listings from the Blue Snail Style website and the customer reviews.
Now the IT staff begins playing with the previously ingested unstructured data sources using
the built-in enrichments algorithms, and using the preview API to evaluate the results. And
Discovery allows them to see relationships between the different product
listings that pure text-based processing couldn’t. For instance:
- The online catalog reveals 81 percent of the company’s products are machine washable.
- The hierarchal data structure finds that the product offerings are comprised of 49 percent
men’s clothing, 32 percent women’s clothing and 19 percent unisex products.
- The customer reviews reveal the highest ranked products have red as a color.