ReplaceRegex function

Applies to: TBM Studio 12.0, 12.1

Use the ReplaceRegex function to cleanse data by replacing text using regular expressions.

Where to use

This function can be used in:
  • Data sets
  • Formula columns in report tables (Though this is not recommended. Use inference instead.)
  • Dynamic text

Syntax

ReplaceRegex(column_name,match_expression,replacement_expression)

Arguments

column_name

The name of the column containing the text to be replaced.

match_expression

The regular expression that will identify the text to be replaced.

replacement_expression

The regular expression used to create the replacement text.

Return type

String

Example

Assume the following syntax:

ReplaceRegex([Source Column],[Regex Statement],[Capture Group])

If the regex statement finds a match in the source column, then the formula will return the value of the capture group. If no match is found, it will return the value of the source column. The regex flavor that is supported in the function is Java.

Here's a specific example:

=ReplaceRegex(Journal Entry
      Description,"^.*(INVOICE).{0,8}?(\d{10}).*","$2")

This formula will look for the word "INVOICE" (in all caps) in the column Journal Entry Description. If it finds that string, it will return the first 10-digit number it finds within 8 characters of the end of the word. You could use this formula to parse 10-digit invoice numbers out of a free text journal entry description. In this example, the first capture group ($1) is "INVOICE" and the second capture group ($2) is the ten digit number (\d{10}).

Here's an example of its output:

Now, let's break down what's going on in the regular expression of this example:
  • The final .* matches any character, any number of times.
  • Again, the parenthesis specify this number is a capture group. In this case, it's capture group is 2.
  • 10} means that the prior rule must apply to 10 characters in a row. So, \d{10} means "match 10 numbers in a row."
  • The \d means to match any number.
  • The question mark (?) means that the prior match should match the smallest number of characters possible. For example, if you have an 11-digit number, the first 10 digits will be matched by the next term, not the last 10. Removing the ? will match the last 10 instead.
  • The period (.) means the same as before. However, the {0,8} means that it matches as few as 0 and as many as 8 characters.
  • The parenthesis that invoice is wrapped in specifies that it's value should be captured. The leftmost value wrapped in parenthesis is capture group 1. Each additional capture group increments the number by 1, from left to right.
  • The word "INVOICE" means match the exact word INVOICE. It's case sensitive.
  • The asterisk (*) means that the preceding criteria can match any number of characters. So, period asterisk (.*) means match any number of any character.
  • The period (.) is a wild card that will match any characters.
  • The caret (^) means that our regular expression must start matching at the beginning of the value in the Journal Entry Description.

Notice the output of the row:

MY MAIN INVOICE # IS 6872954678, BUT WE ALSO HAVE A MASTER INVOICE OF 1234567890

This output is matching the second invoice number, not the first. This is because of the very first period Asterisk (.*) in the regular expression. This term will match as many characters as it can until it finds one that it cannot match. If we wanted it to instead match the smallest number of characters possible, we could add a ? to our regular expression giving:

=ReplaceRegex(Journal Entry
      Description,"^.*?(INVOICE).{0,8}?(\d{10}).*","$2")

For more information on regex syntax or to test your own regex statements, visit http://www.rexegg.com/regex-quickstart.html and regex101.com.