Parsing rules

Parsing rules identify patterns of text that represent particular concepts that you are interested in, such as person names or corporate takeovers. For example, you can create rules that identify information about corporate takeovers, such as the text IBM acquired Lotus Development.

Each parsing rule defines a particular sequence of one or more annotations that are generated by previous stages of the UIMA pipeline, such as the following types of annotations:

  • Tokens such as words, punctuation, or numbers
  • Terms that are defined in a custom dictionary
  • Annotations that are created by a character rule or another parsing rule

You can create parsing rules based on sample text that contains the particular sequence of annotations that you want to identify in your documents. After the sample text is analyzed by Content Analytics Studio, the pattern of annotations that represent the selected text is displayed in a tree format. You can then generalize the pattern to match similar occurrences of the same concept and define one or more annotations to create when matching text is found in the document.

For example, you might create a rule to identify a person name by dragging the text Sir Winston Churchill from an annotated document to the parsing rules editor. Content Analytics Studio analyzes the annotations in the text and might find the following sequence:

  • A Title annotation that is generated by matching the word Sir in a dictionary that contains a list of titles that are used to identify people, such as Mr, Mrs, and Dr.
  • A FirstName annotation that is generated by matching the word Winston in a dictionary that contains a list of given names.
  • A Token annotation that begins with an uppercase character and is a proper noun.

You can generalize this rule by specifying that the title is optional, so the rule matches both Sir Winston Churchill and Winston Churchill. You can then specify that the matched text is to be marked as a Person annotation.

You can also create features for the annotations, such as a surname feature for the Person annotation. In addition, you can create normalized features whose values are based on the values of other existing features on the annotation. For example, you can create a normalized feature to convert the values of certain features to lowercase letters.

Before you create parsing rules, determine the types of concepts that you want to identify and the intermediary elements, such as indicators and triggers, that can help identify the concepts. Then, create dictionaries to contain those terms. For example, if you want to create a rule to identify company names, you might first create a dictionary of company indicators such as Co and Inc.

Parsing rules are stored in a parsing rules database. This database is then built into a parsing rules JAR file that can be used in the parsing rules stage of a UIMA pipeline to analyze text and annotate items of interest.

Tips: To help with maintenance of the rules:
  • Group similar parsing rules in the same database.
  • It is easier to manage and maintain groups of simple parsing rules. If the annotation sequence for a rule becomes too complex to manipulate, consider creating a separate parsing rule in the same database to handle alternative cases. You can define multiple rules that create annotations with the same name.