Contents


A Java solution to use regular expressions and pattern matching in DB2 for Linux, UNIX, and Windows 9.7

Comments

DB2 for Linux, UNIX, and Windows 9.7 and regular expressions

In programming, a regular expression is written in a formal language, also known as regex or regexp. It is a formal pattern that a specific program looks for in text or a string of characters using a recognition process called pattern matching.

You might want to use pattern matching and regular expressions to perform basic functions like:

  • checking to see whether a given sequence would look like a given pattern
  • replacing subsequences with alternative subsequences if you have a specific pattern in a sequence
  • matching a specific occurrence of a subsequence in a given input sequence
  • matching a specific occurrence and position of a subsequence in a given input sequence

DB2 for Linux, UNIX, and Windows does not natively support the use of regular expression functions. If you come across an instance where you need to use regular expression functions, then you will likely have to create user-defined functions to handle your requirement.

This article shows you how to write your own Java user-defined functions to get the benefits of the java.util.regex package which offers atomic features around pattern matching. This package offers an engine that performs operations on character sequences by interpreting patterns, a pattern being a compiled expression of a regular expression. Java patterns are in conformance with Level 1 of Unicode Technical Standard #18: Unicode Regular Expression Guidelines, plus RL2.1 Canonical Equivalents.

Solutions discussed in this article do not provide support for collating sequences, extended grapheme clusters, or tailored grapheme clusters.

This article includes the following sections:

  • Java regex functions that propose specifications for the Java functions.
  • Matching modes, which describe the different types of pattern matching that can be specified through the function's parameter MODE.
  • Example Java code, which shows an example of Java code that could be used to implement regex functions, as well as the related code for DB2 user defined functions.
  • Enablement, which explains how to deal with the zip file available for download from this article, which gives you a complete example that you can reuse.
  • Example SQL statements, which formulates some SQL queries using DB2 Java user-defined functions that are discussed in this article.

Java regex functions

First function

INTEGER REGEXP_LIKE (SOURCE VARCHAR(3000),
                     REGEX VARCHAR(512),
                     MODE VARCHAR(3) )

This function compiles the given regular expression into a pattern with the given mode and attempts to match the given input against it.

Parameters:

  • source: The input string the regular expression is matched against.
  • regex: The regular expression.
  • mode: Match modes:
    • CASE_INSENSITVE_MODE
    • DOTALL_MODE
    • MULTI_LINE_MODE

Return value:

  • Value is True if the source matches against the regular expression, otherwise false is returned.
  • Specified mode should be correct, otherwise false is returned.
  • If the regular expression is not well formed, then false is returned.
  • If the source string is null or empty, then 0 is returned.

Second function

VARCHAR(3000) REGEXP_REPLACE (SOURCE VARCHAR(3000), 
					REGEX VARCHAR(512), 
					REPLACEMENT VARCHAR(3000), 
					POSITION INTEGER,
					OCCURRENCE INTEGER,
					MODES VARCHAR(3))

This function searches for given patterns in a string and replaces them with a new value.

Parameters:

  • source: The string to search in.
  • pattern: The pattern to search for.
  • Replacement: The string that will replace the found string.

    The replacement string may contain references to subsequences captured during the previous match.

    Each occurrence of $g will be replaced by the result of evaluating group(g). The first number after the $ is always treated as part of the group reference. Subsequent numbers are incorporated into g if they would form a legal group reference. Only the numerals 0 through 9 are considered as potential components of the group reference. If the second group matched the string foo, for example, then passing the replacement string $2bar would cause foobar to be appended to the string buffer.

    A dollar sign ($) may be included as a literal in the replacement string by preceding it with a backslash (\$).

  • startPosition: Where in the search string you wish to start search from. This value should be greater or equal to 1. Any value lower than 1 would result in this method to return null. The position of the first character is 1.
  • occurrence: The occurrence parameter specifies which match to replace. A value of 0 means replace all values. If the number specified is greater than zero, then only the occurrence found is replaced. If there are fewer matches found than required, nothing is done to the source.
  • mode: Match modes.

Return value:

This is the string that has the appropriate regular expressions matches replaced with the replacement value. It returns null if the source string is null or empty.

Third function

VARCHAR(3000) REGEXP_SUBSTR (	SOURCE VARCHAR(3000),
					REGEX VARCHAR(512),
					POSITION INTEGER,
					OCCURRENCE INTEGER,
					MODES VARCHAR(3) )

This function attempts to match and return a specific occurrence of a pattern that is defined by the regular expression against the source string.

Parameters:

  • source: The string to search in.
  • pattern: The pattern to search for.
  • startPosition: Where in the search string you wish to start a search from. This value should be greater or equal to 1. Any value lower than 1 would result in this method to return null. The position of the first character is 1.
  • occurrence: The occurrence parameter specifies which match to get a value of. A value of 1 implies to get the first match, and a higher number means that an attempt to match the regex starting at the end of the previous match is done until all matches specified are found, and then only the last match is returned. If there are fewer matches found than required, then null is returned.
  • mode: Match modes.

Return value:

Returns the subsequence of the previous match. It returns a null if no match is found, or if the source string is null or empty. Position 0 is the first character in the source string.

Fourth function

INTEGER REGEXP_INSTR (	SOURCE VARCHAR(3000),
	                  REGEX VARCHAR(512),
				POSITION INTEGER,
				OCCURRENCE INTEGER,
				ROPT INTEGER,
				MODES VARCHAR(3))

This function attempts to match a specific occurrence of the pattern that is defined by the regular expression against the source string. It then returns the position in the source string of either the first character of the matched occurrence, or the first character after the matched occurrence.

Parameters:

  • source: The string to search in.
  • pattern: The pattern to search for.
  • startPosition: Where in the search string you wish to start search from. This value should be greater or equal to 1. Any value lower than 1 would result in this method to return 0. The position of the first character is 1.
  • occurrence: The occurrence parameter specifies which match to get. A value of 1 implies to get the first match. A higher number means that an attempt to match the regex starting at the end of the previous match is done until all matches specified are found. Only the last match is returned. If there are fewer matches found than required, 0 is returned.
  • returnOption: Set returnOption to 0 to get the position of the first character in match. If set to 1 the position of the first character after the match is returned. Any other value will result in pattern not being found and 0 will be returned.
  • Mode: Match modes.

Return value:

This is the character matched position. Position 1 is the first character in the source string. It returns 0 if the match cannot be found or -1 if source string is null or empty.

Matching modes

The mode parameter that each of the four regex functions accepts should be a string of up to three characters, out of four possible modes. The mode i turns on case insensitive matching, while inm turns on those three options. Modes i and c are mutually exclusive. If you omit this parameter or pass an empty string, the default matching modes are used.

  • Mode i: Turn on case insensitive matching. The default is case sensitive.
  • Mode c: Turn on case sensitive matching.
  • Mode n: Make the dot match any character, including new lines. By default, the dot matches any character except new lines.
  • Mode m: Make the caret (^) and dollar sign ($)match at the start and end of each line, for example, after and before line breaks embedded in the source string. By default, these only match at the very start and the very end of the string.

If none of the above modes is specified, the value returned will be equal to UNIX_LINES, which is always on, and means that only the \n line terminator is recognized in the behavior of ., ^, and $.

The mode specified using the given parameters described above should respect the following pattern: ^([nmi]{0,3})|([nmc]{0,3})$.

If the specified mode is incorrect, any of the regex functions will return an error.

Implementation

Java code

Regexp Java class
package com.ibm.avalanche.udf.regex;

import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.regex.PatternSyntaxException;

public class Regexp
{
   private static final String CASE_INSENSITVE_MODE = "i";
   private static final String CASE_SENSITVE_MODE = "c";
   private static final String DOTALL_MODE = "n";
   private static final String MULTI_LINE_MODE = "m";

   public static int regexpInstr(String source, String pattern, int startPosition, 
   int occurrence, int returnOption, String mode) {...}
 
   public static String regexpSubstr(String source, String pattern, int startPosition, 
   int occurrence, String mode) {...}

   public static String regexpReplace(String source, String pattern, String replacement, 
   int startPosition, int occurrence, String mode) {...}

   public static int regexpLike(String source, String regex, String mode) {...}

   public static int setMode(String mode) {…}
}
Function regexpLike
public static int regexpLike(String source, String regex, String mode)
{
   int flags = 0;

   if (  source == null || source.length()<1)
   {
      return 0;
   }

   if ((flags = setMode(mode)) == -1)
   {
      return 0;
   }
   try
   {
      Matcher matcher = Pattern.compile(regex, flags).matcher(source);

      if (matcher.find())
      {
         return 1;
      }
   }
   catch (PatternSyntaxException e)
   {
      return 0;
   }

      return 0;
}
Function regexpReplace
public static String regexpReplace(String source, String pattern, String replacement, 
int startPosition, int occurrence, String mode)
{
   int flags = -1;

   if (  source == null || source.length()<1)
   {
      return null;
   }

   if ((replacement == null) || (startPosition-- <= 0) || ((flags = setMode(mode)) == -1))
   {
      return source;
   }

   Pattern pat;
   try
   {
      pat = Pattern.compile(pattern, flags);
   }
   catch (PatternSyntaxException e)
   {
      return source;
   }

   Matcher matcher = pat.matcher(source);
   int count = 0;
   StringBuffer sb = new StringBuffer();

   if (matcher.find(startPosition))
   {
      if (occurrence == 0)
      {
         matcher.appendReplacement(sb, replacement);
         while (matcher.find())
         {
            matcher.appendReplacement(sb, replacement);
         }
      }
      else
      {
         do
         {
            if (++count == occurrence)
            {
               matcher.appendReplacement(sb, replacement);
               break;
            }
         } while (matcher.find());
      }
      matcher.appendTail(sb);

      return sb.toString();
   }

   return source;
}
Function regexpSubstr
public static String regexpSubstr(String source, String pattern, int startPosition, 
int occurrence, String mode)
{
   int flags = -1;

   if (  source == null || source.length()<1)
   {
      return null;
   }

   if ((startPosition-- <= 0) || ((flags = setMode(mode)) == -1))
   {
      return null;
   }

   Pattern pat;
   try
   {
      pat = Pattern.compile(pattern, flags);
   }
   catch (PatternSyntaxException e)
   {
      return null;
   }

   Matcher matcher = pat.matcher(source);
   String out = null;
   int count = 0;

   if (matcher.find(startPosition))
   {
      out = matcher.group();
      count++;
      while ((count < occurrence) && (matcher.find()))
      {
         out = matcher.group();
         count++;
      }
   }

   return (out = (count != occurrence) ? null : out);
}
Function regexpInstr
public static int regexpInstr(String source, String pattern, int startPosition, 
int occurrence, int returnOption, String mode)
{
   int flags = -1;

   if (  source == null || source.length()<1)
   {
      return -1;
   }

   if ((returnOption != 1) && (returnOption != 0) || (startPosition-- <= 0) || 
   ((flags = setMode(mode)) == -1))
   {
      return 0;
   }

   Pattern pat;
   try
   {
      pat = Pattern.compile(pattern, flags);
   }
   catch (PatternSyntaxException e)
   {
      return 0;
   }

   Matcher matcher = pat.matcher(source);
   int out = 0;
   int count = 0;

   if (matcher.find(startPosition))
   {
      out = (returnOption == 0 ? matcher.start() : matcher.end()) + 1;
      count++;
      while ((count < occurrence) && (matcher.find()))
      {
         out = (returnOption == 0 ? matcher.start() : matcher.end()) + 1;
         count++;
      }
   }

   return (out = (count != occurrence) ? 0 : out);
}
Function setModes
public static int setMode(String mode)
{
   int flag = Pattern.UNIX_LINES;

   try
   {
      if (!Pattern.matches("^([nmi]{0,3})|([nmc]{0,3})$", mode))
      {
         return -1;
      }

      flag |= (mode.contains(CASE_INSENSITVE_MODE) ? Pattern.CASE_INSENSITIVE : 0);
      flag |= (mode.contains(DOTALL_MODE) ? Pattern.DOTALL : 0);
      flag |= (mode.contains(MULTI_LINE_MODE) ? Pattern.MULTILINE : 0);

      // Meaningless from the code point of view but is made to insure consistency with 
      Oracle modes. This // method has to provide the CASE_SENSITIVE_MODE mode for the 
      'c' option.
      flag |= (mode.contains(CASE_SENSITVE_MODE) ? 0 : 0);
   }
   catch (PatternSyntaxException e)
   {
      return -1;
   }
   catch (NullPointerException ne)
   {
     return -1;
   }

   return flag;
}

User defined functions

Deploy JAR file in DB2
CALL SQLJ.INSTALL_JAR('file:C:\avalanche\db2_regex\lib\db2_regex.jar', db2_regex);
@
User define function REGEXP_LIKE
CREATE OR REPLACE FUNCTION REGEXP_LIKE(SOURCE VARCHAR(3000), REGEX VARCHAR(512), 
MODE VARCHAR(3))

RETURNS INTEGER
FENCED 
NOT DETERMINISTIC
NO SQL
LANGUAGE JAVA
PARAMETER STYLE JAVA
EXTERNAL NAME 'db2_regex:com.ibm.avalanche.udf.regex.Regexp.regexpLike'
NO EXTERNAL ACTION
@
User define function REGEXP_REPLACE
CREATE OR REPLACE FUNCTION REGEXP_REPLACE(SOURCE VARCHAR(3000), REGEX VARCHAR(512), 
REPLACEMENT VARCHAR(3000), POSITION INTEGER, OCCURRENCE INTEGER, MODES VARCHAR(3))

RETURNS VARCHAR(3000)
FENCED 
NOT DETERMINISTIC
NO SQL
LANGUAGE JAVA
PARAMETER STYLE JAVA
EXTERNAL NAME 'db2_regex:com.ibm.avalanche.udf.regex.Regexp!regexpReplace'
NO EXTERNAL ACTION
@
User define function REGEXP_SUBSTR
CREATE OR REPLACE FUNCTION REGEXP_SUBSTR(SOURCE VARCHAR(3000), REGEX VARCHAR(512), 
POSITION INTEGER, OCCURRENCE INTEGER, MODES VARCHAR(3))

RETURNS VARCHAR(3000)
FENCED 
NOT DETERMINISTIC
NO SQL
LANGUAGE JAVA
PARAMETER STYLE JAVA
EXTERNAL NAME 'db2_regex:com.ibm.avalanche.udf.regex.Regexp!regexpSubstr'
NO EXTERNAL ACTION
@
User define function REGEXP_INSTR
CREATE OR REPLACE FUNCTION REGEXP_INSTR(SOURCE VARCHAR(3000), REGEX VARCHAR(512), 
POSITION INTEGER, OCCURRENCE INTEGER, ROPT INTEGER, MODES VARCHAR(3))

RETURNS INTEGER
FENCED 
NOT DETERMINISTIC
NO SQL
LANGUAGE JAVA
PARAMETER STYLE JAVA
EXTERNAL NAME 'db2_regex:com.ibm.avalanche.udf.regex.Regexp!regexpInstr'
NO EXTERNAL ACTION
@

Enablement

The following steps show you how to download and try out the sample code.

  1. Save attached db2-regex.zip file to a new folder on the hard disk, for example C:\avalanche.
  2. Extract all files from the zip file to C:\avalanche\db2_regex.
  3. Open C:\avalanche\db2_regex\scripts\sql\db2_regex_functions.sql.
  4. Edit the script to suit your environment. On the second line, change the directory name. If you are using UNIX, the syntax should be something like:
    CALL SQLJ.INSTALL_JAR('file:/tmp/avalanche/db2_regex/lib/db2_regex.jar', db2_regex)
  5. In a DB2 command line processor window, run the following command lines:
    db2 connect to <my_db> user <uid> using <pwd>
    db2 set current schema='REGEXP'
    db2 -td@ -vf "C:\avalanche\db2_regex\scripts\sql\db2_regex_functions.sql"

If you have already deployed the package and you want to deploy it again, for example, if you have modified the JAVA code and you want to try the updated code, then do the following steps:

  1. Extract the zip file to C:\avalanche\db2_regex.
  2. From a DB2 command window, execute the following commands:
    db2 connect to <my_db> user <uid> using <pwd> db2 set current schema='REGEXP' db2 drop function REGEXP_LIKE db2 drop function REGEXP_REPLACE db2 drop function REGEXP_SUBSTR db2 drop function REGEXP_INSTR db2 call SQLJ.REMOVE_JAR(db2_regex) db2stop force db2start db2 connect to <my_db> user <uid> using <pwd> db2 set current schema='REGEXP' db2 -td@ -f C:\avalanche\db2_regex\scripts\sql\db2_regex_functions.sql

You should check to ensure that no error was raised at any point.

Example of SQL statements using regex functions

select ID from REGEXP.REGEXP_STRINGS where 
REGEXP_LIKE(STRING, '^.EF[ ]+SAVEALIAS[ ]+[0-9]+', 'c') > 0

select ID from REGEXP.REGEXP_STRINGS where 
REGEXP_REPLACE(STRING, '^.EF[]+SAVEALIAS[ ]+[0-9]+', 'XX', 1, 1, 'c')='XX'

select ID from REGEXP.REGEXP_STRINGS where 
REGEXP_SUBSTR(STRING, '^.EF[ ]+SAVEALIAS[ ]+[0-9]+', 1, 1, 'c')='DEF SAVEALIAS 2210'

select ID from REGEXP.REGEXP_STRINGS where 
REGEXP_INSTR(STRING, '^.EF[ ]+SAVEALIAS[ ]+[0-9]+',1, 1, 1, 'c') > 0

Conclusion

With this article, you have learned how to create DB2 Java user-defined functions to handle regular expression functions in your SQL statements.

You saw four examples of functions, to which you can extend, modify, and add new functions to help you better match your own requirements.

And you can use the available download to get a full example of the code, which is a sample and available as is. Please, use it freely, modify it, and adapt it for your personal use, but make sure to test and validate it before you use it in any live system.


Downloadable resources


Related topics


Comments

Sign in or register to add and subscribe to comments.

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=580349
ArticleTitle=A Java solution to use regular expressions and pattern matching in DB2 for Linux, UNIX, and Windows 9.7
publish-date=11182010