You can configure Watson Explorer Content Analytics to store data as triples in a Resource Description Framework (RDF) store. Storing triples can improve the analysis of large text documents by capturing the relationships between multiple resources.
After the triples are in the RDF store, you search or analyze the triples by running a SPARQ query. You can also create facets that are based on these triples by creating an analytics facet dictionary in Content Analytics Studio and exporting it to your Watson Explorer Content Analytics collection. If you want to use the data in another application or add the data to the RDF store of another collection, you can download the RDF content by using Content Analytics Studio.
For example, within millions of online documents, you are interested in only the sections that contain a noun, followed by a verb, followed by another noun. You create an annotator to identify all text that conforms to this noun-verb-noun format. But within these noun-verb-noun patterns, you want to search for some specific information. Although the annotator identifies and annotates the documents that contain the pattern, it is not so straightforward to search only the noun-verb-noun pattern. In such a case, you can add all the noun-verb-noun values to the RDF store and then search the triples by using the REST API.
To extract triples to the RDF store, you must first annotate the data that you want to store and create features for the subject, predicate, and object. For example, you define a parsing rule in Content Analytics Studio to annotate a noun-verb-noun pattern with the UIMA type myTriples, and created separate features noun1, verb, and noun2. You then export the annotator to Watson Explorer Content Analytics and associate it with your collection so that the noun-verb-noun pattern is annotated in every crawled document. Then, you define the triple pattern value to extract to the RDF store, such as by specifying noun1 as the subject feature, verb as the predicate feature, and noun2 as the object feature. After you restart parse and index services and crawl the documents, triples that match the specified pattern are stored in the RDF store.
If you use the embedded triplestore database, you can also add content to the RDF store by using Content Analytics Studio to upload RDF files. If you use a DB2 RDF store, you can add content to the RDF store by using the DB2 loadrdfstore command.
If needed, you can define regular expressions to refine the values that are added to the RDF store. For example, in the feature values that are generated by the custom text analysis rules, extracted entities are concatenated. You can specify a regular expression to extract separate values from the concatenated value. For example, if the text analysis rules generate a feature value in the format noun-verb-noun, you can specify a regular expression to extract only the verb as the predicate value.
In some cases, you might want to specify criteria to determine which triple pattern values are to be added to the RDF store. For example, you defined multiple text analysis rules but you want to store the output of only a specific rule. Because the annotations that are generated by all the text analysis rules have the same UIMA type, you specify criteria to store only the annotations that are generated by the specific rule according to its category, such as $.myword. In this case, specify com.ibm.takmi.nlp.annotation_type.ContiguousContext:category for the feature to evaluate, and specify $.myword as the regular expression pattern to match.
If you enable text search for the RDF store, you can search the parsed and indexed RDF content. You can run a text search by using the REST API to submit a SPARQL query with the search predicate in the following format:
?Variable <http://www.ibm.com/wca/rdf/function#search> (Literal_Search_Keyword Number_of_Results)
For example, you submit the SPARQL query SELECT * WHERE {?X <http://www.ibm.com/wca/rdf/function#search> ('white' 10)}. This query returns up to 10 resources whose associated literals include the term white. The resources are returned in a variable X.
If text search is not enabled, the search predicate cannot be used in SPARQL queries, and SPARQL queries can be used for exact matching only. For example, a query for ibm wat does not return IBM Watson if text search is not enabled.
When you enable text search, you must select a collection in which to index the RDF data. For improved search performance, select an enterprise search collection that does not contain any non-RDF content and for which no crawlers are defined.
If you enable statistical analysis of links, you can submit SPARQL queries with the linkAnalysis predicate to analyze the statistical weight of links in the stored RDF graph.
?Variable <http://www.ibm.com/wca/rdf/function#linkAnalysis> (Resource Predicate Predicate Number_of_Results)
This query discovers other resources that are linked to the specified resource. The first predicate is a predicate of the first hop from the origin resource and the second predicate is a predicate of the second hop.
For example, you submit the SPARQL query SELECT * WHERE {?X <http://www.ibm.com/wca/rdf/function#linkAnalysis> (<http://www.example.ibm.com/rdf#gold> <http://www.example.ibm.com/rdf#predicate1> <http://www.example.ibm.com/rdf#predicate2> 10)}. This query traces links of the specified predicates from the origin resource gold up to two hops. All paths are weighted with a calculated score and the 10 most relevant results are returned in the variable X. If gold and diamond are frequently found together in the documents, the query returns diamond as a result.