Detecting entities with regular expressions

Similar to detecting entities with dictionaries, you can use regex pattern matches to detect entities.

Regular expressions are not provided in files like dictionaries but in-memory within a regex configuration. You can use multiple regex configurations during the same extraction.

Regexes that you define with Watson Natural Language Processing can use token boundaries. This way, you can ensure that your regular expression matches within one or more tokens. This is a clear advantage over simpler regular expression engines, especially when you work with a language that is not separated by whitespace, such as Chinese.

Regular expressions are processed by a dedicated component called Rule-Based Runtime, or RBR for short.

Creating regex configurations

Begin by creating a module directory inside your notebook. This is a directory inside the notebook file system that is used temporarily to store the files created by the RBR training. This module directory can be the same directory that you created and used for dictionary-based entity extraction. Dictionaries and regular expressions can be used in the same training run.

To create the module directory in your notebook, enter the following in a code cell. Note that the module directory can't contain a dash (-).

import os
import watson_nlp
module_folder = "NLP_RBR_Module_2"
os.makedirs(module_folder, exist_ok=True)

A regex configuration is a Python dictionary, with the following attributes:

Available attributes in regex configurations with their values, descriptions of use and indication if required or not
Attribute	Value	Description	Required
`name`	string	The name of the regular expression. Matches of the regular expression in the input text are tagged with this name in the output.	Yes
`regexes`	list (string of perl based regex patterns)	Should be non-empty. Multiple regexes can be provided.	Yes
`flags`	Delimited string of valid flags	Flags such as UNICODE or CASE_INSENSITIVE control the matching. Can also be a combination of flags. For the supported flags, see Pattern (Java Platform SE 8).	No (defaults to DOTALL)
`token_boundary.min`	int	`token_boundary` indicates whether to match the regular expression only on token boundaries. Specified as a dict object with `min` and `max` attributes.	No (returns the longest non-overlapping match at each character position in the input text)
`token_boundary.max`	int	`max` is an optional attribute for `token_boundary` and needed when the boundary needs to extend for a range (between `min` and `max` tokens). `token_boundary.max` needs to be `>= token_boundary.min`	No (if `token_boundary` is specified, the `min` attribute can be specified alone)
`groups`	list (string labels for matching groups)	String index in list corresponds to matched group in pattern starting with 1 where 0 index corresponds to entire match. For example: `regex: (a)(b)` on `ab` with `group: ['full', 'first', 'second']` will yield `full: ab, first: a, second: b`	No (defaults to label match on full match)

The regex configurations can be loaded using the following helper methods:

To load a single regex configuration, use watson_nlp.toolkit.RegexConfig.load(<regex configuration>)
To load multiple regex configurations, use watson_nlp.toolkit.RegexConfig.load_all([<regex configuration>)])

Code sample

This sample shows you how to load two different regex configurations. The first configuration detects person names. It uses the groups attribute to allow easy access to the full, first and last name at a later stage.

The second configuration detects acronyms as a sequence of all-uppercase characters. By using the token_boundary attribute, it prevents matches in words that contain both uppercase and lowercase characters.

from watson_nlp.toolkit.rule_utils import RegexConfig

# Load some regex configs, for instance to match First names or acronyms
regexes = RegexConfig.load_all([
    {
        'name': 'full names',
        'regexes': ['([A-Z][a-z]*) ([A-Z][a-z]*)'],
        'groups': ['full name', 'first name', 'last name']
    },
    {
        'name': 'acronyms',
        'regexes': ['([A-Z]+)'],
        'groups': ['acronym'],
        'token_boundary': {
            'min': 1,
            'max': 1
        }
    }
])

Training a model that contains regular expressions

After you have loaded the regex configurations, create an RBR model using the RBR.train() method. In the method, specify:

The module directory
The language of the text
The regex configurations to use

This is the same method that is used to train RBR with dictionary-based extraction. You can pass the dictionary configuration in the same method call.

Code sample

# Train the RBR model
custom_regex_block = watson_nlp.resources.feature_extractor.RBR.train(module_path=module_folder, language='en', regexes=regexes)

Applying the model on new data

After you have trained the dictionaries, apply the model on new data using the run() method, as you would use on any of the existing pre-trained blocks.

Code sample

custom_regex_block.run('Bruce Wayne works for NASA')

Output of the code sample:

{(0, 11): ['regex::full names'], (0, 5): ['regex::full names'], (6, 11): ['regex::full names'], (22, 26): ['regex::acronyms']}

To show the matching subgroups or the matched text:

import json
# Get the raw response including matching groups
full_regex_result = custom_regex_block.executor.get_raw_response('Bruce Wayne works for NASA‘, language='en')
print(json.dumps(full_regex_result, indent=2))

Output of the code sample:

{
  "annotations": {
    "View_full names": [
      {
        "label": "regex::full names",
        "fullname": {
          "location": {
            "begin": 0,
            "end": 11
          },
          "text": "Bruce Wayne"
        },
        "firstname": {
          "location": {
            "begin": 0,
            "end": 5
          },
          "text": "Bruce"
        },
        "lastname": {
          "location": {
            "begin": 6,
            "end": 11
          },
          "text": "Wayne"
        }
      }
    ],
    "View_acronyms": [
      {
        "label": "regex::acronyms",
        "acronym": {
          "location": {
            "begin": 22,
            "end": 26
          },
          "text": "NASA"
        }
      }
    ]
  },
...
}

Parent topic: Creating your own models