Configuring character rules graphically

You can configure character rule expressions based on sample text that contains a particular character sequence.

About this task

Before you create character rules, you must create a character rules database and include the compiled dictionary file in the lexical analysis stage of your UIMA pipeline. You can then analyze the sample text that contains the patterns to use for the basis of the rules.

After the sample text is analyzed by Content Analytics Studio, the pattern of character classes that represent the selected text is displayed in a tree format. You can modify the character sequence for the rule to match, such as modifying the pattern to match similar sequences of characters, and then define one or more annotations to create when matching text is found in the document. You can also create features for the annotations. After you add the rule to the database, rebuild the character rules file.

Alternatively, you can manually add character class elements to a character rule by using the Add Character option. Using this approach, you can create a character rule without dragging any sample text. For example, this approach might be an easier way to create character rules when the exact pattern that you want to match is not available in the text.

Procedure

To create character rules:

  1. Create a character rules database.
    In the Studio Explorer view, right-click the Resources/Character Rules directory in your project and click New > Character Rules Database.
  2. Include the character rules file in your UIMA pipeline configuration:
    1. From the Configuration/Annotators directory, open the ANNOCONFIG file for your pipeline.
    2. Select the Lexical Analysis stage, select the appropriate language, and add the new character rules DIC file to the list of dictionaries.
  3. Run an initial analysis to tokenize one or more documents that contain sample character sequences on which to base the character rules.
    From the Documents directory, open a document. Right-click the document in the editor view and click Analyze Document. Ensure that you select the UIMA pipeline to which you added the new character rules file.
  4. From the Resources/Character Rules directory, open the new character rule database by double-clicking it.
  5. Add rules to the database by using the Create Character Rules view:
    1. Define the character sequence for the rule to match.
      Drag the sample character sequence from your annotated document to the Selection tab where the text is displayed as a tree of Unicode character class nodes. You can then refine the match criteria by configuring the nodes of the tree, such as generalizing the pattern to match similar occurrences of the same concept.
      For example, you can create a rule to match United States telephone numbers by dragging the sequence 1-800-426-4968 from an annotated document to the Selection tab. To generalize this rule so that it also matches phone numbers without an area code, specify that the area code prefix is optional by right-clicking the nodes that represent the area code characters and setting the Group option to Occurring zero or one time.
    2. Specify how to annotate text that matches the specified character sequence.
      On the Annotation tab, right-click any node, click Insert Annotation, and specify a name for the new annotation, such as USPhoneNumber.
    3. Optional: Create features for the new annotations.
      For example, for a United States phone number annotation you might create a feature for the area code. In the annotations tree on the Annotation tab, select the node that represents the area code and drag the node under the Features node of the USPhoneNumber annotation that you created.
    4. Optional: Specify the rule set on the Selection tab.
      Rule sets are used for grouping related rules together.
    5. Add the rule to the database by clicking the Add/Save the current rule icon in the Create Character Rules view.
  6. Rebuild the character rules file by clicking the Build icon.
  7. Test the rule by reviewing the updated annotations in your sample document.
    In the Outline view for the annotated document, verify that the new annotations are now displayed. If the rule did not identify all instances of the character sequence in the document, refine the criteria that you specified for the rule in the Create Character Rules view.
    Tip: To temporarily disable a rule, select the Properties > Omit rule from build option in the Create Character Rules view. You can omit a rule from the build to compare the results of similar rules that you create and determine which of the rules best identifies text without having to delete and re-create the rules for each test.

What to do next

Whenever you add or change character rules, you must rebuild the character rules file from the database before your pipeline can use the updated rules to analyze documents.