Matches regular expression check

This data rule definition logic checks to see if your data matches a regular expression.

Syntax

Positive:

source-data  MATCHES_REGEX pattern-string

Negative:

source data  NOT MATCHES_REGEX pattern-string

Description

You can use a regular expression to search for a particular character, or a character string in the beginning, middle, or end of a word. You can also use regular expressions to find a particular variable, a range of characters, or specific kinds of characters, such as printable or unprintable characters. When positive, it finds rows for which the source data matches the specified expression. When negative, it finds rows for which the source data does not match the specified expression.

Any lowercase character that you specify in a pattern string searches for that specific lowercase character. Any uppercase character that you specify in a pattern string indicates that particular uppercase letter. Any specific digit you specify, from zero to nine, indicates the particular digit that you specified in the pattern string. You can use these operators in combination with other valid characters to build very complex expressions.

The following table provides examples of regular expression operators that you can use to construct your regular expression data rule definition logic:
Table 1. Examples of regular expression operators
Operator Function
. (period) Indicates a match for a single character.
$ (dollar sign) Indicates the end of a line.
^ (caret) Indicates that the pattern string starts at the beginning of a line.
[uppercase character] Indicates a match for a specific uppercase character.
[lowercase character] Indicates a match for a specific lowercase character.
[digit from 0–9] Indicates a match for a specific single digit.
\d [a particular digit from 0–9] Indicates a match for a specific single digit.
\b Indicates a match at a word boundary.
\s Indicates a match for any whitespace character.
\S Indicates a match for any non-whitespace character.
\w Indicates a match for any single alphanumeric character.
[ ] Indicates the start for the character class definition that you specify.
[^X] Indicates a match for the start of a line, followed by the specific characters that you specify.
X? Indicates a match for no or one occurrence of the qualifier. It also extends the meaning of the qualifier.
X* Indicates a match for a qualifier of zero or more.
X+ Indicates a match for one or more qualifiers.
{X, Y} Indicates a match between X and Y number of occurrences.
(X|Y) + Indicates the start of an alternative branch.
? Indicates an optional match.

Regular expression support

Additional regular expression information is provided by the Perl Compatible Regular Expression (PCRE) library package, which is open source software, written by Philip Hazel, and copyright by the University of Cambridge, England. The PCRE documentation can be found at http://regexkit.sourceforge.net/Documentation/pcre/pcrepattern.html.

Notes

  • You can use a expression that is made up of scalar functions, arithmetic operations and variables as a source data. If you do this, make sure that each expression evaluates to a character or string data type.
  • To perform a simple format check, you can use the matches format check.
  • To prevent incorrect evaluation results of data that is padded with the null character (ASCII '00') apply the appropriate trim function (trim, ltrim, or rtrim) in the expression.

Examples

Postal Codes
  • You are searching for all US postal codes in a column:
    '\b[0-9]{5}(?:-[0-9]{4})?\b'
  • You are searching for all Canadian postal codes in a column:
    '\b[ABCEGHJKLMNPRSTVXY][0-9][A-Z] [0-9][A-Z][0-9]\b'  
  • You are searching for all UK postal codes in a column:
    '\b[A-Z]{1,2}[0-9][A-Z0-9]? [0-9][ABD-HJLNP-UW-Z]{2}\b'
  • You are searching for all US postal codes in a column that start with '02', such as 02112, 02113, 02114, and so on:
    '\b02[0-9]{3}(?:-[0-9]{4})?\b'
Phone numbers
  • You are searching for phone numbers in the standard North American format. You want to account for variations in the format such as 123-456-7890, (123)-456-7890, 123 456 7890 , 123.456.7890 and so on:
    '\(?[0-9]{3}\)?[-. ]?[0-9]{3}[-. ]?[0-9]{4}'
  • You want to search for phone numbers with the US area code of 212:
    \(?212\)?[-. ]?[0-9]{3}[-. ]?[0-9]{4}'
IP addresses
  • You want to find IP addresses that match 0.0.0.0 to 999.999.999.999:
    '\b([0-9]{1,3})\.([0-9]{1,3})\.([0-9]{1,3})\.([0-9]{1,3})\b'
    or
    '\b(?:[0-9]{1,3}\.){3}[0-9]{1,3}\b'
  • You want to verify that all values in a column have the format of an IP address:
    ipaddress matches_regex '\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}'
Specific words
  • You are searching for a word:
    '\b%word%\b'
  • You are searching for words that occur more than twice in a column:
    '\b(\w+)\s+\1\b'
  • You are searching for pairs of words from the following list: firstword, secondword, thirdword. You want to find the words that are separated by at least one word, but up to five words:
    '\b(firstword|secondword|thirdword)(?:\W+\w+){1,5}\W+
    (firstword|secondword|thirdword)\b'
  • You are searching from one of the following words: firstword, secondword, or thirdword.
    '\b(?:firstword|secondword|thirdword)\b'
  • You are searching for a word that does not begin with uppercase characters:
    '\b(a-z)+(a-z,A-Z)*\b'
Others
  • Find names at the beginning of a line, which start with "Smi,":
    firstnamecolumn matches_regex '^Smi'
  • Find all file names that begin with an uppercase A, followed by any number of characters ending with ".txt":
  • filenamecolumn matches_regex '^A.*\.txt$'
  • You are searching for an employee named Kristie Jones who has accounts on various applications, in which she has used the first names Kristie, Kristin, and Kristen as her first name:
    firstname matches_regex ('Kristie'|'Kristin'|'Kristen')[ ]'Jones'