Custom global analysis

In addition to the default global analysis tasks that occur during the indexing process, you can configure Watson Explorer Content Analytics to run custom global analysis tasks.

Restriction: Custom global analysis is available only for collections that use IBM® InfoSphere® BigInsights. Jaql must be installed on the InfoSphere BigInsights server.

You can use custom global analysis to obtain information by examining the entire document set, as opposed to examining each document individually. For example, consider the following sample use cases:

For each document, count the number of times it is cited by another document: Suppose that your collection consists of documents such as journal articles and patents that cite other documents, and you want to calculate the number of times that each document is cited in another document. You use a custom annotator during the document processing to extract information about the cited documents in each document. You then configure a custom global analysis task to count the number of citations for each document in the entire document set. This value is then saved in a new field of each cited document. In the content analytics miner, a user can then sort the documents by this value to determine which documents are most-often cited by other documents in the collection.

Count the number of named entities that were extracted from documents in a collection: During the document processing, Watson Explorer Content Analytics extracts named entities from each document. You can configure a custom global analysis task to count how many times each named entity occurs in all the documents of the collection. This value is then saved in a file so that another application can use the data.

The custom global analysis logic is implemented by creating a Jaql (Query Language for JSON) script. The inputs for the script are the fields, facets, and text that are extracted from the content during the document processing stage. The output from the script can be stored as document fields or facets in the Watson Explorer Content Analytics index. You can also specify in your Jaql script to save the output in a file or some other format so that another application can use the data.

The Jaql script must be included in an archive file that has the .zip file extension. In addition to the Jaql script, the archive file must contain the custom global analysis configuration file (install.jaql) and any other files that are needed by the custom Jaql script.

After you develop a custom analysis script, configure a custom global analysis task for a collection in the administration console to specify which fields and facets to pass to the script for analysis. In the Parse and Index pane of the administration console, click Configure > Global processing > Custom global analysis and click the Add icon.

By default, your custom global analysis task automatically runs after indexing finishes. Alternatively, you can specify a schedule to configure how often the task runs, such as every day or every six hours.

For example, your crawlers are scheduled to run several times a day. After the crawler finishes, the indexing process automatically starts. After the indexing process finishes, the global analysis task automatically starts by default. Because indexing and global analysis are sequential operations, they cannot run at the same time. During global analysis, no documents can be indexed. If your custom global analysis task requires a long time to run, documents are crawled regularly but are not indexed as long as the custom global analysis task is running. In such as case, if you want to index documents regularly even if the newly indexed documents do not have the results of global analysis, you can disable the automatic run. Instead, schedule global analysis to run once a day so that the documents have the result of GA eventually.

Because the input and output data need to be stored for the custom global analysis task, additional disk space is needed. The amount of additional disk space that is required depends on which data is used and generated by the custom global analysis task. The input data is stored on each InfoSphere BigInsights node and the output is stored on the Watson Explorer Content Analytics master server.