Regular Expression Definitions

When extracting nonlinguistic entities, you may want to edit or add to the regular expression definitions that are used to identify regular expressions. This is done in the Regular Expression Definitions section in the Advanced Resources tab. See the topic About Advanced Resources for more information.

The file is broken up into distinct sections. The first section is called [macros]. In addition to that section, an additional section can exist for each nonlinguistic entity. You can add sections to this file. Within each section, rules are numbered (regexp1, regexp2, and so on). These rules must be numbered sequentially from 1–n. Any break in numbering will cause the processing of this file to be suspended altogether.

In certain cases, an entity is language dependent. An entity is considered to be language dependent if it takes a value other than 0 for the language parameter in the configuration file. See the topic Configuration for more information. When an entity is language dependent, the language must be used to prefix the section name, such as [english/PhoneNumber]. That section would contain rules that apply only to English phone numbers when the PhoneNumber entity is given a value of 2 for the language.

Important! If you make changes to this file or any other in the editor and the extraction engine no longer works as desired, use the Reset to Original option on the toolbar to reset the file to the original shipped content. This file requires a certain level of familiarity with regular expressions. If you require additional assistance in this area, please contact IBM® Corp. for help.

Special Characters . [] {} () \ * + ? | ^ $

All characters match themselves except for the following special characters, which are used for a specific purpose in expressions: .[{()\*+?|^$ To use these characters as such, they must be preceded by a backslash (\) in the definition.

For example, if you were trying to extract Web addresses, the full stop character is very important to the entity, therefore, you must backslash it such as:

     www\.[a-z]+\.[a-z]+

Repetition Operators and Quantifiers ? + * {}

To enable the definitions to be more flexible, you can use several wildcards that are standard to regular expressions. They are * ? +

  • Asterisk * indicates that there are zero or more of the preceding string. For example, ab*c matches "ac", "abc", "abbbc", and so on.
  • Plus sign + indicates that there is one or more of the preceding string. For example, ab+c matches "abc", "abbc", "abbbc", but not "ac".
  • Question mark ? indicates that there is zero or one of the preceding string. For example, modell?ing matches both "modeling" and "modeling".
  • Limiting repetition with brackets {} indicates the bounds of the repetition. For example,

    [0-9]{n} matches a digit repeated exactly n times. For example, [0-9]{4} will match “1998”, but neither “33” nor “19983”.

    [0-9]{n,} matches a digit repeated n or more times. For example, [0-9]{3,} will match “199” or “1998”, but not “19”.

    [0-9]{n,m} matches a digit repeated between n and m times inclusive . For example, [0-9]{3,5} will match “199”, “1998” or “19983”, but not “19” nor “199835”.

Optional Spaces and Hyphens

In some cases, you want to include an optional space in a definition. For example, if you wanted to extract currencies such as "uruguayan pesos", "uruguayan peso", "uruguay pesos", "uruguay peso", "pesos" or "peso", you would need to deal with the fact that there may be two words separated by a space. In this case, this definition should be written as (uruguayan |uruguay )?pesos?. Since uruguayan or uruguay are followed by a space when used with pesos/peso, the optional space must be defined within the optional sequence (uruguayan |uruguay ). If the space was not in the optional sequence such as (uruguayan|uruguay)? pesos?, it would not match on “pesos” or “peso” since the space would be required.

If you are looking for a series of things including a hyphen characters (-) in a list, then the hyphen must be defined last. For example, f you are looking for a comma (,) or a hyphen (-), use [,-] and never [-,].

Order of Strings in Lists and Macros

You should always define the longest sequence before a shorter one or else the longest will never be read since the match will occur on the shorter one. For example, if you were looking for strings “billion” or “bill”, then “billion” must be defined before “bill”. So for instance (billion|bill) and not (bill|billion). This also applies to macros, since macros are lists of strings.

Order of Rules in the Definition Section

Define one rule per line. Within each section, rules are numbered (regexp1, regexp2, and so on). These rules must be numbered sequentially from 1–n. Any break in numbering will cause the processing of this file to be suspended altogether. To disable an entry, place a # symbol at the beginning of each line used to define the regular expression. To enable an entry, remove the # character before that line.

In each section, the most specific rules must be defined before the most general ones to ensure proper processing. For example, if you were looking for a date in the form “month year” and in the form “month”, then the “month year” rule must be defined before the “month” rule. Here is how it should be defined:

     #@# January 1932
     regexp1=$(MONTH),? [0-9]{4}

     #@# January
     regexp2=$(MONTH)

and not

     #@# January
     regexp1=$(MONTH)

     #@# January 1932
     regexp2=$(MONTH),? [0-9]{4}

Using Macros in Rules

Whenever a specific sequence is used in several rules, you can use a macro. Then, if you need to change the definition of this sequence, you will need to change it only once, and not in all the rules referring to it. For example, assuming you had the following macro:

     MONTH=((january|february|march|april|june|july|august|september|october|
     november|december)|(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)(\.)?)

Whenever you refer to the name of the macro, it must be enclosed in $(), such as: regexp1=$(MONTH)

All macros must be defined in the [macros] section.