IBM Support

Regular Expressions

News


Abstract

Regular expression support has been added to Db2 for i.

Content


Regular expression support has been added to Db2 for i.  Consult the SQL Reference for complete details.  An overview of the support appears below.


Regular Expressions
Regular expressions have been around for a long time as a way to find certain patterns within a string. Regular expression patterns are expressed using a combination of characters and predefined meta characters.

A common task is trying to find all strings within a larger string that look like a phone number. The regular expression pattern '[\d]{7}' will find all instances of 7 digits in a row. What this pattern is saying is 'find the character 0 or 1 or 2 or 3... or 9 a total of 7 times sequentially'. \d is short hand for any digit character and is differentiated from the 'd' character by \, the escape character. The first pattern could be modified to look for phone numbers with hyphens by using '\d{3}-\d{4}'. This pattern reads 'find 3 digits followed by a dash followed by 4 more digits'.

With the new regular expression functions, one could count how many phone numbers are found, return all instances of a phone number, replace all phone numbers with another, or completely remove all phone numbers. And that's just one simple use case. This query support could be used to search for names, addresses, social security numbers, credit card numbers, file names, file types, dates, variable declarations, and anything else that follows a known pattern.

All of the regular expression functions work with a wide variety of data types.

 



REGEXP_COUNT
REGEXP_COUNT is used to find out how many times a pattern appears in a string.
   

This example determines how many error messages appear in a string:
values REGEXP_COUNT('ERROR -213 ERROR -5921','ERROR -\d+') → 2.

 


REGEXP_LIKE
REGEXP_LIKE is a predicate and as such is used in the WHERE clause of your query to reduce the number of rows selected.

A simple example:

select 1 from sysibm.sysdummy1 where REGEXP_LIKE('foobar','foo')

Another example shows how to search for rows including people whose birthday is in June or July
(assuming YMD format for the date):
select * from PeopleList where REGEXP_LIKE(PeopleList.BirthdayDate,'\d\d\d\d-0[67]-\d\d')

 



REGEXP_INSTR
REGEXP_INSTR is useful for getting the position of patterns in a string. Depending on the parameters, it can return the beginning position or the position after the end of the matched pattern. This follows standard substring rules where the end parameter is the exclusive end of the string rather than the inclusive end. The values returned use a 1 based array system not a 0 based array system.

By default it returns the starting position:
values REGEXP_INSTR('skcor usi', 'c..') → 3
values REGEXP_INSTR('Vanilla Ice was the best artist to come out of the 80s','ice',1,1,1,'i') → 12 because it is 1 based and because parameter 5 indicated that it should return the position after the end of the match.

 



REGEXP_REPLACE
REGEXP_REPLACE is useful not only for its ability to change information, but for the ability to remove it.
Back to the phone example, it can be used to cover sensitive information.
Values REGEXP_REPLACE('My name is Steve and my phone number is 555-5555', '\d{3}-\d{4}','xxx-xxxx') → 'My name is Steve and my phone number is xxx-xxxx'
Or it can be used to remove unneeded or erroneous information.
values REGEXP_REPLACE('MyTextDoc.exe','\....','') → MyTextDoc

 



REGEXP_SUBSTR
REGEXP_SUBSTR returns, as you would guess, a substring of the matched pattern. An example of this is trying to retrieve a website from a string.
values REGEXP_SUBSTR('Like, the best website ever is myfacespacebook.gov. I post all of my pictures of clouds that look like things there. Hey, have you heard of the cloud? Is it a real cloud? What happens to your pictures if it rains?', '(\w+\.)+((org)|(com)|(gov))')

With REGEXP_SUBSTR, you could easily mine the string to find any websites referenced. It would be just as easy to look for email addresses.

 


There are many other ways to tweak regular expressions to get what you're looking for. Starting position, parameter flags such as 'i' for case insensitive, and all of the shorthand characters are available to help make the best match in the easiest way.

values REGEXP_INSTR('echoECHOecho', 'echo', 3, 1, 0, 'i') → 5
The explanation for this is simple. It starts a position 3. Then it finds the first echo, ECHO, or anything between because 'i' is used as a flag which meaning case insensitive. 'c' is the default value, meaning case sensitive. If that were the case, we would get 9 for a result. We want the starting position as signified by the 5th parameter, 0. After it is run, 5 is the position returned.

Regular expressions are extremely useful and powerful. Data analysis and analytics are made much simpler by honing in on important information. Searching, counting, and many string operations are simplified without enormous overhead or performance costs.

In this final example, we see a way to use everything together. Here is a way to get all of the phone numbers from a string. First you find out the total number of occurrences. Then you use the number in a loop invariant and as the occurrence parameter in REGEXP_SUBSTR.

CREATE VARIABLE Loop INT DEFAULT 1;
CREATE VARIABLE NumTimes INT;
SET NUMTIMES = REGEXP_COUNT(SourceString, RegexPhoneNumPattern);

WHILE Loop <= NumTimes DO
  INSERT INTO ResultsTable VALUES   
    REGEXP_SUBSTR(SourceString,RegexPhoneNumPattern,1,LOOP);
END WHILE;

[{"Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"SWG60","label":"IBM i"},"Component":"","Platform":[{"code":"PF012","label":"IBM i"}],"Version":"All Versions","Edition":"","Line of Business":{"code":"LOB08","label":"Cognitive Systems"}}]

Document Information

Modified date:
13 January 2020

UID

ibm11164718