RDF stores
You can configure Watson™ Explorer Content Analytics to store data as triples in a Resource Description Framework (RDF) store. Storing triples can improve the analysis of large text documents by capturing the relationships between multiple resources.
After the triples are in the RDF store, you search or analyze the triples by running a SPARQ query. You can also create facets that are based on these triples by creating an analytics facet dictionary in Content Analytics Studio and exporting it to your Watson Explorer Content Analytics collection. If you want to use the data in another application or add the data to the RDF store of another collection, you can download the RDF content by using Content Analytics Studio.
For example, within millions of online documents, you are interested in only the sections that contain a noun, followed by a verb, followed by another noun. You create an annotator to identify all text that conforms to this noun-verb-noun format. But within these noun-verb-noun patterns, you want to search for some specific information. Although the annotator identifies and annotates the documents that contain the pattern, it is not so straightforward to search only the noun-verb-noun pattern. In such a case, you can add all the noun-verb-noun values to the RDF store and then search the triples by using the REST API.
Prerequisites for using Db2 as the RDF store
- Db2 files
- antlr-3.3-java.jar
- rdfstore.jar
- wala.jar
- db2jcc4.jar
- db2jcc_license_cu.jar
- db2rdf.jar
- Apache Commons file
- commons-logging-1.1.1.jar
- Apache Jena 2.7.3 files
- jena-arq-2.9.3.jar
- jena-core-2.7.3.jar
- jena-iri-0.9.3.jar
Adding data to the RDF store
To extract triples to the RDF store, you must first annotate the data that you want to
store and create features for the subject, predicate, and object. For example, you define a
parsing rule in Content Analytics Studio to annotate a noun-verb-noun
pattern with the UIMA type myTriples
, and created separate features
noun1
, verb
, and noun2
. You then export
the annotator to Watson Explorer Content Analytics and associate it with your
collection so that the noun-verb-noun pattern is annotated in every crawled document. Then,
you define the triple pattern value to extract to the RDF store, such as by specifying
noun1
as the subject feature, verb
as the predicate
feature, and noun2
as the object feature. After you restart parse and index
services and crawl the documents, triples that match the specified pattern are stored in the
RDF store.
If you use the embedded triplestore database, you can also add content to the RDF store by using Content Analytics Studio to upload RDF files. If you use a Db2 RDF store, you can add content to the RDF store by using the Db2 loadrdfstore command.
Expressions for refining the stored values
If needed, you can define regular expressions to refine the values that are added to the RDF store. For example, in the feature values that are generated by the custom text analysis rules, extracted entities are concatenated. You can specify a regular expression to extract separate values from the concatenated value. For example, if the text analysis rules generate a feature value in the format noun-verb-noun, you can specify a regular expression to extract only the verb as the predicate value.
Specifying additional matching criteria
In some cases, you might want to specify criteria to determine which triple pattern values
are to be added to the RDF store. For example, you defined multiple text analysis rules but
you want to store the output of only a specific rule. Because the annotations that are
generated by all the text analysis rules have the same UIMA type, you specify criteria to
store only the annotations that are generated by the specific rule according to its
category, such as $.myword
. In this case, specify
com.ibm.takmi.nlp.annotation_type.ContiguousContext:category for
the feature to evaluate, and specify $.myword as the regular
expression pattern to match.
Text search
If you enable text search for the RDF store, you can search the parsed and indexed RDF
content. You can run a text search by using the REST API to submit a SPARQL query with the
search
predicate in the following format:
?Variable <http://www.ibm.com/wca/rdf/function#search> (Literal_Search_Keyword Number_of_Results)
For example, you submit the SPARQL query SELECT * WHERE {?X
<http://www.ibm.com/wca/rdf/function#search> ('white' 10)}. This query
returns up to 10 resources whose associated literals include the term white.
The resources are returned in a variable X
.
If text search is not enabled, the search
predicate cannot be used in
SPARQL queries, and SPARQL queries can be used for exact matching only. For example, a query
for ibm wat does not return IBM Watson if text search is not
enabled.
When you enable text search, you must select a collection in which to index the RDF data. For improved search performance, select an enterprise search collection that does not contain any non-RDF content and for which no crawlers are defined.
Statistical analysis of links
If you enable statistical analysis of links, you can submit SPARQL queries with the
linkAnalysis
predicate to analyze the statistical weight of links in the
stored RDF graph.
?Variable <http://www.ibm.com/wca/rdf/function#linkAnalysis> (Resource Predicate Predicate Number_of_Results)
This query discovers other resources that are linked to the specified resource. The first predicate is a predicate of the first hop from the origin resource and the second predicate is a predicate of the second hop.
For example, you submit the SPARQL query SELECT * WHERE {?X
<http://www.ibm.com/wca/rdf/function#linkAnalysis>
(<http://www.example.ibm.com/rdf#gold> <http://www.example.ibm.com/rdf#predicate1>
<http://www.example.ibm.com/rdf#predicate2> 10)}. This query traces links
of the specified predicates from the origin resource gold up to two hops. All
paths are weighted with a calculated score and the 10 most relevant results are returned in
the variable X
. If gold and diamond are
frequently found together in the documents, the query returns diamond as a
result.