RDF stores

You can configure Watson™ Explorer Content Analytics to store data as triples in a Resource Description Framework (RDF) store. Storing triples can improve the analysis of large text documents by capturing the relationships between multiple resources.

RDF provides a flexible method to decompose knowledge into smaller parts, called triples, with rules about the relationships between those parts. In Watson Explorer Content Analytics, you can store the triples in the embedded triplestore database, or you can configure the system to store the triples in a Db2 10.5 database.

After the triples are in the RDF store, you search or analyze the triples by running a SPARQ query. You can also create facets that are based on these triples by creating an analytics facet dictionary in Content Analytics Studio and exporting it to your Watson Explorer Content Analytics collection. If you want to use the data in another application or add the data to the RDF store of another collection, you can download the RDF content by using Content Analytics Studio.

For example, within millions of online documents, you are interested in only the sections that contain a noun, followed by a verb, followed by another noun. You create an annotator to identify all text that conforms to this noun-verb-noun format. But within these noun-verb-noun patterns, you want to search for some specific information. Although the annotator identifies and annotates the documents that contain the pattern, it is not so straightforward to search only the noun-verb-noun pattern. In such a case, you can add all the noun-verb-noun values to the RDF store and then search the triples by using the REST API.

Prerequisites for using Db2 as the RDF store

To use Db2 10.5 as the RDF store for Watson Explorer Content Analytics, you must first create the Db2 RDF store. Then, copy the following files to the ES_INSTALL_ROOT/lib/rdf directory:
  • Db2 files
    • antlr-3.3-java.jar
    • rdfstore.jar
    • wala.jar
    • db2jcc4.jar
    • db2jcc_license_cu.jar
    • db2rdf.jar
  • Apache Commons file
    • commons-logging-1.1.1.jar
  • Apache Jena 2.7.3 files
    • jena-arq-2.9.3.jar
    • jena-core-2.7.3.jar
    • jena-iri-0.9.3.jar

Adding data to the RDF store

To extract triples to the RDF store, you must first annotate the data that you want to store and create features for the subject, predicate, and object. For example, you define a parsing rule in Content Analytics Studio to annotate a noun-verb-noun pattern with the UIMA type myTriples, and created separate features noun1, verb, and noun2. You then export the annotator to Watson Explorer Content Analytics and associate it with your collection so that the noun-verb-noun pattern is annotated in every crawled document. Then, you define the triple pattern value to extract to the RDF store, such as by specifying noun1 as the subject feature, verb as the predicate feature, and noun2 as the object feature. After you restart parse and index services and crawl the documents, triples that match the specified pattern are stored in the RDF store.

Restriction: You cannot extract triples to the RDF store if your system includes multiple document processing servers.

If you use the embedded triplestore database, you can also add content to the RDF store by using Content Analytics Studio to upload RDF files. If you use a Db2 RDF store, you can add content to the RDF store by using the Db2 loadrdfstore command.

Tip: If many triples are extracted to an RDF store, more time might be required to store the data and run SPARQL queries. To optimize performance, check the query performance before you put large amounts of data in the RDF store.

Expressions for refining the stored values

If needed, you can define regular expressions to refine the values that are added to the RDF store. For example, in the feature values that are generated by the custom text analysis rules, extracted entities are concatenated. You can specify a regular expression to extract separate values from the concatenated value. For example, if the text analysis rules generate a feature value in the format noun-verb-noun, you can specify a regular expression to extract only the verb as the predicate value.

Specifying additional matching criteria

In some cases, you might want to specify criteria to determine which triple pattern values are to be added to the RDF store. For example, you defined multiple text analysis rules but you want to store the output of only a specific rule. Because the annotations that are generated by all the text analysis rules have the same UIMA type, you specify criteria to store only the annotations that are generated by the specific rule according to its category, such as $.myword. In this case, specify com.ibm.takmi.nlp.annotation_type.ContiguousContext:category for the feature to evaluate, and specify $.myword as the regular expression pattern to match.

Text search

If you enable text search for the RDF store, you can search the parsed and indexed RDF content. You can run a text search by using the REST API to submit a SPARQL query with the search predicate in the following format:

?Variable <http://www.ibm.com/wca/rdf/function#search> (Literal_Search_Keyword Number_of_Results)

For example, you submit the SPARQL query SELECT * WHERE {?X <http://www.ibm.com/wca/rdf/function#search> ('white' 10)}. This query returns up to 10 resources whose associated literals include the term white. The resources are returned in a variable X.

If text search is not enabled, the search predicate cannot be used in SPARQL queries, and SPARQL queries can be used for exact matching only. For example, a query for ibm wat does not return IBM Watson if text search is not enabled.

When you enable text search, you must select a collection in which to index the RDF data. For improved search performance, select an enterprise search collection that does not contain any non-RDF content and for which no crawlers are defined.

Tip: If a text search collection is configured, the entire store is scanned and all triples are sent to the search collection to be indexed each time that you import an RDF file from Content Analytics Studio. If you need to import an RDF file multiple times, you can improve the processing time by disabling the text search collection setting before you import the file the first time, and enabling text search before you import the RDF file the last time.

Statistical analysis of links

If you enable statistical analysis of links, you can submit SPARQL queries with the linkAnalysis predicate to analyze the statistical weight of links in the stored RDF graph.

?Variable <http://www.ibm.com/wca/rdf/function#linkAnalysis> (Resource Predicate Predicate Number_of_Results)

This query discovers other resources that are linked to the specified resource. The first predicate is a predicate of the first hop from the origin resource and the second predicate is a predicate of the second hop.

For example, you submit the SPARQL query SELECT * WHERE {?X <http://www.ibm.com/wca/rdf/function#linkAnalysis> (<http://www.example.ibm.com/rdf#gold> <http://www.example.ibm.com/rdf#predicate1> <http://www.example.ibm.com/rdf#predicate2> 10)}. This query traces links of the specified predicates from the origin resource gold up to two hops. All paths are weighted with a calculated score and the 10 most relevant results are returned in the variable X. If gold and diamond are frequently found together in the documents, the query returns diamond as a result.