Configuring a UIMA pipeline

You can configure a UIMA pipeline file to specify the sequence of linguistic resources to use to analyze documents.

About this task

The UIMA pipeline configuration file is a place holder for the linguistic resources and configuration parameters that are used to annotate documents. The resources are included in various stages of the pipeline that run consecutively and interact together to analyze documents and generate annotations.

For each stage in the pipeline, you can view a list of the UIMA types that are required as input to that stage and the UIMA types that are generated by the stage.
Tip: If a stage requires an input type that is not generated by an earlier stage in the pipeline, that input type is displayed with a warning icon. Right-click the missing type and click Find to view a list of the Content Analytics Studio resources in your project that generate the missing type. Then, add one of those resources to the pipeline.

Configuring a UIMA pipeline is an iterative process. As you create more resources, such as new custom dictionaries and parsing rules databases, you must go back and edit the UIMA pipeline configuration file to include these resources as part of the analysis process.

Procedure

To configure a UIMA pipeline:

  1. In the Studio Explorer view, right-click the Configuration/Annotators directory in your project and click New > UIMA Pipeline Configuration.
    Tip: If your UIMA pipeline includes only dictionaries, or you plan to include parsing rules but did not yet create them, clear the Include parsing rules stage in the pipeline check box. You can add a parsing rules stage later.
  2. Configure the stages of the UIMA pipeline:
    1. In the UIMA Pipeline Stages list, click Document Language and specify a method for identifying the language of each document.
      If all documents are in the same language, you can manually specify that language.
      Tip: If you accept the default option to automatically determine the document language, edit the Acceptable Languages list to specify the languages for which you expect to have documents. Specifying the list of possible languages helps to ensure that Content Analytics Studio identifies the correct language for each document.
    2. Click Lexical Analysis and specify a list of resources such as lexical dictionaries, character rules dictionaries, and custom dictionaries for each language in which you expect to have documents.
      You can also specify which break rules to use for splitting a document into paragraphs, sentences, and tokens.
    3. If your pipeline includes a parsing rules stage, click Parsing Rules and specify a list of parsing rule files for each language in which you expect to have documents.
      Tip: If you specify multiple parsing rule files, the order in which you list the files affects the order in which the rules are processed. That is, rules in the first file are processed first, followed by the rules in the second file. If the rules in a file depend on annotations that are created by rules in a different file, ensure that the files are listed in the correct order.
    4. Optional: Add and configure additional pipeline stages.
      For example, you can add a PEAR stage to include annotators that are packaged as a PEAR file. You can also add a semantic analysis stage to find connections between annotations that are identified in the document. You can add a condition or switch stage to run an annotator stage in only certain conditions, such as running different lexical analysis stages with particular sets of dictionaries depending on the source of the document.
    5. Click Clean Up and select the annotation types that are not to be included in the final output.
      Tip: If you want to remove some intermediary types from the final output but still view these types in the Content Analytics Studio annotation editor, select the Show removed types in the annotation editor check box. For example, you might want to view these intermediary types in the annotation editor so that you can use these types as inputs to a parsing rule.