Character rules
Character rules identify sequences of characters that represent particular entities in your text, such as telephone numbers, email addresses, or product identifiers.
For example, you might want to identify United States telephone numbers such as 704-501-1500. But because phone numbers can be written in other ways, such as (704) 501-1500, writing a regular expression to find all other the possible variations can be challenging.
Character rules graphical editor
By using the Content Analytics Studio character rules editor, you can generate these character rule expressions graphically by basing the rules on sample text that contains the character sequences. After the sample text is analyzed by Content Analytics Studio, the pattern of character classes that represent the selected text is displayed in a tree format. You can then modify the pattern to match similar sequences of characters and define one or more annotations to create when matching text is found in the document. You can also create features for the annotations. For a United States phone number annotation, you might create a feature for the area code, which is the first three numbers in the telephone number.
- Group similar character rules in the same database.
- It is easier to manage and maintain groups of simple character rules. If the character sequence tree for a rule becomes too complex to manipulate, consider creating a separate character rule in the same database to handle alternative cases. You can define multiple rules that create annotations with the same name.
Character rules files
Alternatively, you can manually configure character rules by defining variables and regular expressions in CHARRULES files. This approach is for experienced users who want to create complex character rules.