Data mining — Regular Expressions editor

With the Regular Expressions editor, you can edit rule files that are created in the Design Studio. You can use these rule files with the Regular Expression Lookup operator. The Regular Expression Lookup operator uses a build-in regular-expression annotator to find concepts like phone numbers or addresses. The regular-expression annotator is based on regular expressions that describe the patterns that you are looking for in specified text columns.

The Regular Expression editor provides an Explorer section and a View section. The View section shows the properties of the item that is selected in the Explorer section.

The following figure shows the Regular Expression editor with the feature area_code selected in the Types section.

Figure 1. The Regular Expression editor

The graphic above shows the Regular Expression editor with the Types section and the Features section

Explorer section

The Explorer section shows the specified types, their features, and their rules.

A type describes a concept like phone number, area code, or IP address. You can map a concept to one or more rules. The rules specify the regular-expression pattern.

A concept can have one or more subpatterns. You can specify these subpatterns as features. For example, the concept phone_number might include the subpatterns country_code and area_code. If a regular expression matches, you can set the value for one or more features to a subset of this match to enable further processing.

In the Explorer section, you can add, remove, or rename types, features, or rules. For example, to rename a type, you can press F2 or right-click the type to be renamed and select Rename from the popup menu.

View section

The View section shows the properties of the selected item in the explorer section. Depending on the selected item, the following properties are provided:

Type properties

The following type properties are available:

Type Name: The name of the defined type.
The name must not include blanks or special characters.
Description: Optional: A description that provides additional information about the defined type.

Feature properties

The following feature properties are provided:

Feature name

The name of the feature.

Data type

The data type of features is String, Integer, or Float.

If you specify a feature type other than String, you must ensure that the subpattern that is assigned to this feature matches a string value that can be converted to the feature type. For example, for an Integer feature type, the matching string must contain an integer number only.

Default value: String

Description

Optional: A description that provides additional information about the feature.

Rule properties

The following rule properties are provided:

Rule

To define a rule file, you must specify a regular-expression pattern and a match strategy.

The following figure shows the properties of a selected rule. You can modify the sections for the rule mapping details, the test, and the feature mapping. You can also manually change the size of the sections or collapse sections to gain more space for the other sections.

Figure 2. The Regular Expression editor

The figure above shows the Regular Expression editor

The match strategy specifies whether the build-in regular-expression annotator searches for the first occurrence of a document subsequence that matches the regular-expression pattern, for all occurrences of a document subsequence, or for exact matches only.

The following match strategies are provided:

match first occurrence: Stops after the first occurrence of a document subsequence that matches with the pattern.
For example, given the pattern Phone: \d+ in the text Phone: 01234, ?, Phone: 56789, only the string Phone: 01234 is found.
match all occurrences: Finds all occurrences of a document subsequence that match with the pattern.
For example, given the pattern Phone: \d+ in the text Phone: 01234, ?, Phone: 56789, the strings Phone: 01234 and Phone: 56789 are found.
match complete text: Finds only occurrences of a document subsequence that matches exactly with the pattern.
For example, given the pattern foo, the text foobar is not matched.

Default value: match all occurrences

If you are not familiar with regular expressions, you can use the Regular Expression Builder to easily create regular expressions. To open the Regular Expression Builder, click Regular Expression Builder... .

You can also type regular expression patterns that match the concepts you are searching for in the text box. By default, the regular expression concepts are highlighted. You can disable the highlighting by right-clicking a highlighted concept and selecting Highlight regular expression constructs. Disabling the highlighting applies only to the current session.

Moving the cursor over the highlighted regular expression concept displays more information about this concept.

When you are modifying an existing rule, you can delete or restore concepts by right-clicking the modified rule and selecting Undo or Redo from the popup window. You can also use CRTL+Z to remove a concept or CTRL+Y to restore a concept.

Test Rule

With the Test Rule check box, you can disable the test function.

In the Input text field, you can type the text that includes the document subsequence that you want to find with the specified regular-expression pattern.

The Matched field shows the found matches with starting index, ending index and subpatterns.

Features

The table in the Features section shows the features and their properties. For each feature, you can specify the value to be used if the current regular-expression pattern matches.

From the drop-down list, you can select a simple subpattern reference or a fixed value. If the range type of the feature is a string, you can also select a mix of fixed value and subpattern references.

The subpattern identifies a capturing group within the regular-expression pattern of the current rule in terms of a Java capturing group. You can assign a single number from 0 to 9. 0 denotes the whole subsequence, 1 denotes the first match group, 2 denotes the second match group, and so on.

Examples for feature values:

Subpattern matching

You can map a feature to a capturing group (subpattern) of a pattern.

To map to the entire regular expression, the value $0 is used. To map to the first subpattern, the value $1 is used, and so on.

Depending on the used regular expression in the subpattern, the features can have the data type Integer, Float, or Sting. For example, if you want to map to the first subpattern of the rule (\d*)-(\d+)-(\d+), you use $1, which stands for (\d*). Because of the regular-expression pattern (\d*), the data type can be Integer or String.

Fixed Values

If you create a rule for each country, the pattern looks like this: (0049)-(\d+)-(\d+). You can set the feature to a fixed value, for example, Germany. This forces the data type to String.

Mixed usage

You can also select a fixed value and one or more subpattern references for the value of an attribute by setting the value of a feature to country code: $1. This value must have the range type String. At runtime, the value $1 is replaced with the result of the first subpattern.