Bringing the Power of Regular Expression Matching to SQL

The ability to find patterns in a string is a common scenario in many text-based applications. This article describes how to extend DB2 to integrate a library for regular expression matching into DB2, thereby making the task much easier and more efficient.

Share:

Knut StolzeIBM Germany

Knut Stolze started his work with DB2 when he joined IBM as a visiting scientist at the Silicon Valley Lab where he worked on the DB2 Image Extender. He moved on to the DB2 Spatial Extender Version 8 and was responsible for several enhancements to improve the usability, the performance, and the standard-conformance of the Extender for more than two years. Currently, he works as a teaching assistant at the University of Jena, Germany, and continues his work for IBM in the area of federated databases. He can be reached through the newsgroups comp.databases.ibm-db2 and ibm.software.db2.udb.spatial or at stolze@de.ibm.com.



23 January 2003

Also available in Japanese

© 2003 International Business Machines Corporation. All rights reserved.

Important: Please read the disclaimer before reading this article.

Introduction

IBM DB2 e-kit for Database Professionals

Learn how easy it is to get trained and certified for DB2 for Linux, UNIX, and Windows with the IBM DB2 e-kit for Database Professionals. Register now, and expand your skills portfolio, or extend your DBMS vendor support to include DB2.

Finding textual data in a database is a very common scenario in applications. Several full-text-based products are available for IBM® DB2® Universal DatabaseTM, including DB2 Text Information Extender and the DB2 Net Search Extender. However, DB2 Extenders® do not include the functionality to perform regular expression matching, which is used to find and replace patterns in a string. This article describes how you can implement a user-defined function (UDF) that integrates a commonly available library for regular expression matching with DB2. By means of an example, the pcre library (Perl-compatible regular expressions) is used, but it can be replaced with any other library.

The description provided in this article is applicable to both V7 and V8 of DB2 Universal Database for Linux, UNIX®, and Windows®.


Background

This section describes what regular expressions are and why they are useful.

What are regular expressions?

Regular expressions are used to find and replace patterns in strings. The regular expressions are defined using a grammar, and the regular expression engine takes the grammar and compares it with the string. The engine returns an indication of whether the string matches the grammar; that is, if the string contains substrings that can be derived from the grammar. Additionally, the engine can also return the matching substrings. The term "pattern" is used to denote the grammar.

A very basic pattern consists of only a single letter. A string that contain this letter is a "match" when compared with this pattern. For example, if the pattern is "a," then the string "abcd" is a match, but the string "xyz" is not. The power of regular expressions stems from the predefined operators (also called meta-characters) that can express the patterns in a very short space. Depending on the dialect and supported features, different meta-characters might be available. Generally, the following characters are available, among others:

|      - alternatives 
[ ]    - groups 
*      - multiple occurrences (also matches zero occurrences) 
+      - multiple occurrences (at least one) 
?      - optional occurrence 
\\     - backslash

Different systems implement various extensions to the common regular expressions. The regular expressions used in the programming language Perl support further abbreviations. The library that is used in this article implements those extensions. As an excerpt, the following abbreviations, among others, are available in Perl's regular expression language:

\s     - any whitespace character 
\w     - any alphanumeric character 
\d     - any numeric character

Another, more advanced example is the pattern "[A-Z]* = ([0-9]|0x00);". This pattern matches any string that contains a substring that consists of several upper-case letters, followed by a space, an equal sign, another space, and then either a single digit or the string "0x00". The last character of that substring must be a semicolon. Using Perl, this pattern can be expressed as "\w* = (\d|0x00);". Two strings that would match this pattern would be "NM = 0x00;" and "X = 7;". But the string "Z = 123;" does not match because 123 consists of more than one digit.

String matching in DB2

Aside from the Extenders, DB2 allows several functions and operators for text comparisons. But those functions and operators are either restricted in their functionality for pattern matching, or they introduce complexity for the query in which they might be used. Here is a short excerpt of the available functions:

= or <> predicates Compare two strings for equality or non-equality (character-by-character). LIKE predicates Basic pattern matching using wildcards. LOCATE function Find a substring in a string.

Although it is possible to express the pattern "[A-Z]* = ([0-9]|0x00);" using the SQL operators, it can be quite cumbersome. For example, the predicates used in the WHERE clause of the following SELECT statement will match the part after the equal sign in the string "str," as in Listing 1:

Listing 1. Matching a pattern using LIKE
SELECT str 
FROM   strTable 
WHERE ( str LIKE '% = 0;%' OR str LIKE '% = 1;%' OR str LIKE '% = 2;%'  
	OR str LIKE '% = 3;%' OR str LIKE '% = 4;%' OR str LIKE '% = 5;%'  
	OR str LIKE '% = 7;%' OR str LIKE '% = 7;%' OR str LIKE '% = 8;%'  
	OR str LIKE '% = 9;%' OR str LIKE '% = 0x00;%' )

Add to that the complexity of a predicate that can match the "[A-Z]*" sub-pattern, which can be accomplished using a function that iterates over the string and does a character-by-character comparison, and you can see that using built-in functionality is verbose and complex to use.


Example scenario

Let's define the following listing (Listing 2) and insert several rows:

Listing 2. Creating our sample table
CREATE TABLE strTable ( c1 INTEGER, str VARCHAR(500) ); 
INSERT INTO strTable VALUES ( 1, 'some text;' ), 
                            ( 2, 'variable = 1234;' ), 
                            ( 3, 'var2 = ''string variable'';' ), 
                            ( 4, 'xyz = ' ), 
                            ( 5, 'myVar = 0x00;' ), 
                            ( 6, '# comment' ), 
                            ( 7, 'abc = def' );

For all examples below, this listing and its data are used.

SELECT * FROM strTable; 
 
C1          STR 
----------- ------------------------------ 
          1 some text; 
          2 variable = 1234; 
          3 var2 = 'string variable'; 
          4 xyz =  
          5 myVar = 0x00; 
          6 # comment 
          7 abc = def 
 
  7 record(s) selected.

Implementing a pattern-matching function

You can use DB2's extensible mechanisms to use UDFs within a SQL statement to improve the situation quite dramatically. By defining a UDF called regex1 that takes the pattern and the string as input parameters, the WHERE clause in Listing 1 can now be written as shown in Listing 3:

Listing 3. Using the regex UDF to simplify pattern matching
SELECT str 
FROM   strTable 
WHERE regex1('\w* = (\d|0x00);', str) = 1

In this example, the complete pattern is matched using a regular expression with Perl's extensions, not just the partial pattern for which the corresponding LIKE predicates were given in Listing 1. As you can see, it is much easier to write the predicate for this pattern using a function than to express the same semantics using LIKE predicates.

Implementing the UDF

In my example implementation, I chose an existing pattern matching engine, called PCRE (Perl-compatible regular expressions). The engine provides a C API to handle the patterns and perform the matching. The missing piece between that engine and the SQL language used in the queries is the UDF. The UDF consists of two pieces:

  • A CREATE FUNCTION statement that creates (or registers) the function in the database.
  • The body of the function that implements a wrapper around the C API calls for the regular expression matching engine

Listing 4 shows the SQL statement for creating the function.

Listing 4. Registering the regex1 function
CREATE FUNCTION regex1(pattern VARCHAR(2048), string CLOB(10M)) 
    RETURNS INTEGER 
    SPECIFIC regexSimple 
    EXTERNAL NAME 'regexUdf!regexpSimple' 
    LANGUAGE C 
    PARAMETER STYLE DB2SQL 
    DETERMINISTIC 
    NOT FENCED 
    RETURNS NULL ON NULL INPUT 
    NO SQL 
    NO EXTERNAL ACTION 
    ALLOW PARALLEL;

Note: See the DB2 SQL Reference for the detailed meaning of all the clauses. The lengths of the parameters can be adapted to your needs. The values I show here do not imply any recommendations.

The second part consists of a small chunk of C code, which implements the UDF entry point. This entry point is called by DB2 during the query execution for each row to be matched against the pattern. Listing 5 is an example what this code could look like. For the description of the pcre_* function and macros, refer to the documentation of the PCRE library. For compiling the C code and building the shared libraries, refer to the DB2 Application Development Guide.

Listing 5. C code to implement the rege1x UDF entry point
#include <pcre.h> 
#include <sqludf.h> 
 
void regexpSimple( 
    // input parameters 
    SQLUDF_VARCHAR *pattern,      SQLUDF_CLOB *str, 
    // output 
    SQLUDF_INTEGER *match, 
    // null indicators 
    SQLUDF_NULLIND *pattern_ind,  SQLUDF_NULLIND *str_ind, 
    SQLUDF_NULLIND *match_ind, 
    SQLUDF_TRAIL_ARGS) 
{ 
    pcre *re = NULL; 
    const char *error = NULL; 
    int errOffset = 0; 
    int rc = 0; 
 
    // we assume successful return 
    *match_ind = 0; 
 
    // compile the pattern to its internal representation 
    re = pcre_compile(pattern, 0 /* default options */, &error, 
        &errOffset, NULL); 
    if (re == NULL) { 
        snprintf(SQLUDF_MSGTX, 70, "Regexp compilation failed at " 
            "offset %d: %s\n", errOffset, error); 
        strcpy(SQLUDF_STATE, "38900"); 
        (*pcre_free)(re); 
        return; 
    } 
 
    // match the string againts the pattern 
    rc = pcre_exec(re, NULL, str->data, str->length, 0, 
            0 /* default options */, NULL, 0); 
    switch (rc) { 
      case PCRE_ERROR_NOMATCH: 
        *match = 0; 
        break; 
      case PCRE_ERROR_BADOPTION: 
        snprintf(SQLUDF_MSGTX, 70, "An unrecognized bit was set in the " 
            "options argument"); 
        strcpy(SQLUDF_STATE, "38901"); 
        break; 
      case PCRE_ERROR_NOMEMORY: 
        snprintf(SQLUDF_MSGTX, 70, "Not enough memory available."); 
        strcpy(SQLUDF_STATE, "38902"); 
        break; 
      default: 
        if (rc < 0) { 
            snprintf(SQLUDF_MSGTX, 70, "A regexp match error " 
                "occured: %d", rc); 
            strcpy(SQLUDF_STATE, "38903"); 
        } 
        else { 
            *match = 1; 
        } 
        break; 
    } 
 
    // cleanup 
    (*pcre_free)(re); 
    return; 
}

Usage examples

The following query attempts to find all the strings from table strTable that contain comment text. Comments start with a '#', so the pattern is a '#' sign followed by some non-empty text.

SELECT c1, str 
FROM   strTable 
WHERE  regex1('#\s*\w+', str) = 1;

The result contains only the row where c1 = 6.

C1          STR 
----------- ------------------------- 
          6 # comment; 
 
  1 record(s) selected.

In a second example, we try to find those strings that are of the form of an assignment; that is "text = text". To narrow it futher, we are finding only those assignments that have a numerical value on the right-hand side. Hexadecimal representations are treated as valid numerical values.

SELECT c1, str 
FROM   strTable 
WHERE  regex1('\w+\s*=\s*(\d+|0x\d\d)', str) = 1;

Except the two rows where c1 is 2 or 5, no other rows contain an assignment of a numerical value and, thus, do not appear in the result:

C1          STR 
----------- ------------------------- 
          2 variable = 1234; 
          5 myVar = 0x00; 
 
  2 record(s) selected.

Improving performance

Although the above function works as expected, it can be improved to achieve better performance. Note that the faster the execution inside the function is finished, the faster DB2 can process the overall SQL statement.

SQL is designed to process sets of rows, which means that there is usually more than one row to be matched against a pattern. The pattern itself is in most cases constant for the whole SQL statement; that is, it does not change from row to row. The C code in Listing 5 shows that the call to the function pcre_compile(), which converts the given pattern into an internal representation, is performed for each and every row.

DB2 offers a mechanism to carry information from UDF call to UDF call by using a so-called "scratchpad." In addition, you can identify the particular call "type"; that is, it is the first, normal, or the last (final) call that is made to the UDF. With the scratchpad and the call type, it is possible to compile the pattern once and then reuse the internal representation of that compiled pattern for all subsequent calls made to the UDF. At the final call, resources that were allocated during the processing can be released.

Modify the CREATE FUNCTION statement as shown in Listing 6 to tell DB2 to provide a scratchpad and the call type to the external C code:

Listing 6. Adding the scratchpad and call type to the CREATE FUNCTION statement
CREATE FUNCTION regex2(pattern VARCHAR(2048), string CLOB(10M)) 
    RETURNS INTEGER 
    SPECIFIC regexPerf 
    EXTERNAL NAME 'regexUdf!regexpPerf' 
    LANGUAGE C 
    PARAMETER STYLE DB2SQL 
    DETERMINISTIC 
    NOT FENCED 
    RETURNS NULL ON NULL INPUT 
    NO SQL 
    NO EXTERNAL ACTION<font color="0000FF"> 
    SCRATCHPAD 50 
    FINAL CALL</font> 
    ALLOW PARALLEL;

The UDF entry point looks quite different because the logic inside the function must be adapted. The only change for the parameters is to use SQLUDF_TRAIL_ARGS_ALL instead of SQLUDF_TRAIL_ARGS, as shown in Listing 7.

Listing 7. C UDF entry point for regex2
#include <pcre.h> 
#include <sqludf.h> 
 
// data structure mapped on the scratchpad for easier use and access 
// to the objects 
// the size of the scratchpad defined in the CREATE FUNCTION statement 
// must be at least as large as sizeof(scratchPadMapping) 
struct scratchPadMapping { 
    pcre *re; 
    pcre_extra *extra; 
    const char *error; 
    int errOffset; 
}; 
 
void regexpPerf( 
    // input parameters 
    SQLUDF_VARCHAR *pattern,      SQLUDF_CLOB *str, 
    // output 
    SQLUDF_INTEGER *match, 
    // null indicators 
    SQLUDF_NULLIND *pattern_ind,  SQLUDF_NULLIND *str_ind, 
    SQLUDF_NULLIND *match_ind, 
    SQLUDF_TRAIL_ARGS_ALL) // SQLUDF_SCRAT & SQLUDF_CALLT 
{ 
    int rc = 0; 
    struct scratchPadMapping *scratch = NULL; 
 
    // map the buffer of the scratchpad and assume successful return 
    scratch = (struct scratchPadMapping *)SQLUDF_SCRAT->data; 
    *match_ind = 0; 
 
    switch (SQLUDF_CALLT) { 
      case SQLUDF_FIRST_CALL: 
        // initialize data on the scratchpad 
        scratch->re = NULL; 
        scratch->extra = NULL; 
        scratch->error = NULL; 
        scratch->errOffset = 0; 
 
        // compile the pattern (only in the FIRST call 
        scratch->re = pcre_compile(pattern, 0 /* default options */, 
            &scratch->error, &scratch->errOffset, NULL); 
        if (scratch->re == NULL) { 
            snprintf(SQLUDF_MSGTX, 70, "Regexp compilation failed at " 
                "offset %d: %s\n", scratch->errOffset, scratch->error); 
            strcpy(SQLUDF_STATE, "38900"); 
            rc = -1; 
            break; 
        } 
 
        // further analyze the pattern (might return NULL) 
        scratch->extra = pcre_study(scratch->re, 
            0 /* default options */, &scratch->error); 
 
        /* fall through to NORMAL call because DB2 expects a result 
           already in the FIRST call */ 
 
      case SQLUDF_NORMAL_CALL: 
        // match the current string 
        rc = pcre_exec(scratch->re, scratch->extra, str->data, 
              str->length, 0, 0 /* default options */, NULL, 0); 
        switch (rc) { 
          case PCRE_ERROR_NOMATCH: 
            *match = 0; 
            rc = 0; 
            break; 
          case PCRE_ERROR_BADOPTION: 
            snprintf(SQLUDF_MSGTX, 70, "An unrecognized bit was set " 
                "in the options argument"); 
            strcpy(SQLUDF_STATE, "38901"); 
            rc = -1; 
            break; 
          case PCRE_ERROR_NOMEMORY: 
            snprintf(SQLUDF_MSGTX, 70, "Not enough memory available."); 
            strcpy(SQLUDF_STATE, "38902"); 
            rc = -1; 
            break; 
          default: 
            if (rc < 0) { 
                snprintf(SQLUDF_MSGTX, 70, "A regexp match error " 
                    "occured: %d", rc); 
                strcpy(SQLUDF_STATE, "38903"); 
                rc = -1; 
            } 
            else { 
                *match = 1; 
                rc = 0; 
            } 
            break; 
        } 
        break; 
      } 
 
      // cleanup in FINAL call, or if we encountered an error in 
      // the FIRST call (DB2 will make a FINAL call if we encounter 
      // an error in any NORMAL call) 
      if (SQLUDF_CALLT == SQLUDF_FINAL_CALL || 
          (SQLUDF_CALLT == SQLUDF_FIRST_CALL && rc < 0)) { 
          (*pcre_free)(scratch->re); 
          (*pcre_free)(scratch->extra); 
      } 
      return; 
}

To further improve the performance of the function, I added a call to the function pcre_study(), which is provided by the pattern matching engine. That function analyzes the pattern further and stores additional information in a separate structure. This additional information is then used during the actual matching to speed up the processing. Using a very simple pattern and a table of about 4000 rows, I achieved an improvement of 5% in execution time. Of course, the more complex the pattern, the more noticeable the differences will be.

I mentioned above that the implementation assumes that the pattern does not change during the processing from row to row. Of course, you could make a slight adaptation to compile a pattern again if it does change. To do that, it is necessary to keep track of the current (compiled) pattern and to compare it in each call with the provided pattern. The current pattern can also be maintained on the scratchpad. But it must be copied into a separate buffer and cannot be referenced directly via the pointer pattern, because this pointer, or the data referenced by it, might change or become invalid. The respective code changes are left as an exercise for the reader.


Returning matching substrings

Most pattern-matching engine offer a way to return substrings that matched the specified pattern or a part thereof. If you want to use this capability in SQL, you must use a different approach for the implementation of the matching function. A given string might contain more than just one matching substring. For example, when parsing strings like "abc = 123;" or "def = 'some text';" the user might want to retrieve the two substrings, which are separated by the equal sign. You can use the pattern "\w+\s*=\s*(\d+|'[\w\s]*');" to express the syntax rules that apply for the string. Perl-compatible regular expressions let you capture the substrings before and after the equal sign. To that end, the substrings to be captured must be enclosed in parentheses. The second substring is already written that way, but the first substring is not. The final pattern to be used for that purpose is this:

(\w+)\s*=\s*(\d+|'[\w\s]*');

When this pattern is applied to the strings "abc = 123;" or "def = 'some text';", then the "abc" or "def" respectively match with "(\w+)", the spaces and the equal sign is found by "\s*=\s*", and the remaining substring is covered by the alternative "(\d+|'[\w\s*]')". In this alternative, the first option matches any number consisting of at least one digit "\d+" and the second option parses any string consisting of letters and spaces, enclosed in single quotes "'[\w\s]*'".

The requirement on DB2 to do this can be described as returning multiple results for a single call to a UDF. In other words, to return multiple substrings for a single string that is matched against a pattern. DB2's table function are the perfect vehicle for this.

Implementing the table UDF

As before, the function has to be created in the database. The following statement, Listing 8, is used for that:

Listing 8. Registering the table UDF called regex3
CREATE FUNCTION regex3(pattern VARCHAR(2048), string CLOB(10M)) 
    RETURNS TABLE ( position INTEGER, substring VARCHAR(2048) ) 
    SPECIFIC regexSubstr 
    EXTERNAL NAME 'regexUdf!regexpSubstr' 
    LANGUAGE C 
    PARAMETER STYLE DB2SQL 
    DETERMINISTIC 
    NOT FENCED 
    RETURNS NULL ON NULL INPUT 
    NO SQL 
    NO EXTERNAL ACTION 
    SCRATCHPAD 50 
    NO FINAL CALL 
    DISALLOW PARALLEL;

The C code that implements the actual logic of the function is very similar to the code in Listing 7, but it is adapted to the special requirements that must be followed for table functions, as shown in Listing 9.

Listing 9. Implementing the regex3 function to be used with a table function
#include <pcre.h> 
#include <sqludf.h> 
#include <sqlstate.h> 
#include <string.h> 
 
struct scratchPadMapping { 
    pcre *re; 
    pcre_extra *extra; 
    const char *error; 
    int errOffset; 
    int numSubstr; 
    int *substr; 
    int currentSubstr; 
}; 
 
void regexpSubstr( 
    // input parameters 
    SQLUDF_VARCHAR *pattern,      SQLUDF_CLOB *str, 
    // output 
    SQLUDF_INTEGER *pos,          SQLUDF_VARCHAR *substr, 
    // null indicators 
    SQLUDF_NULLIND *pattern_ind,  SQLUDF_NULLIND *str_ind, 
    SQLUDF_NULLIND *pos_ind,      SQLUDF_NULLIND *substr_ind, 
    SQLUDF_TRAIL_ARGS_ALL) // SQLUDF_SCRAT & SQLUDF_CALLT 
{ 
    int rc = 0; 
    size_t length = 0; 
    struct scratchPadMapping *scratch = NULL; 
 
    // map the buffer of the scratchpad and assume NULL return 
    scratch = (struct scratchPadMapping *)SQLUDF_SCRAT->data; 
    *pos_ind = 0; 
    *substr_ind = 0; 
 
    switch (SQLUDF_CALLT) { 
      case SQLUDF_TF_OPEN: 
        // initialize data on the scratchpad 
        scratch->re = NULL; 
        scratch->extra = NULL; 
        scratch->error = NULL; 
        scratch->errOffset = 0; 
        scratch->numSubstr = 0; 
        scratch->substr = NULL; 
        scratch->currentSubstr = 1; // skip the complete match 
 
        // compile the pattern (only in the FIRST call 
        scratch->re = pcre_compile(pattern, 0 /* default options */, 
            &scratch->error, &scratch->errOffset, NULL); 
        if (scratch->re == NULL) { 
            snprintf(SQLUDF_MSGTX, 70, "Regexp compilation failed at " 
                "offset %d: %s\n", scratch->errOffset, scratch->error); 
            strcpy(SQLUDF_STATE, "38900"); 
            rc = -1; 
            break; 
        } 
        // further analyze the pattern (might return NULL) 
        scratch->extra = pcre_study(scratch->re, 
            0 /* default options */, &scratch->error); 
        // determine the number of capturing subpatterns 
        rc = pcre_fullinfo(scratch->re, scratch->extra, 
            PCRE_INFO_CAPTURECOUNT, &scratch->numSubstr); 
        if (rc) { 
            snprintf(SQLUDF_MSGTX, 70, "Could not retrieve info " 
                "on pattern. (rc = %d)", rc); 
            strcpy(SQLUDF_STATE, "38901"); 
            rc = -1; 
            break; 
        } 
        // allocate memory for the substring indices 
        { 
            int size = (scratch->numSubstr+1)*3; 
            scratch->substr = (int *)malloc(size * sizeof(int)); 
            if (!scratch->substr) { 
                snprintf(SQLUDF_MSGTX, 70, "Could allocate memory for " 
                    "substring indices."); 
                strcpy(SQLUDF_STATE, "38902"); 
                rc = -1; 
                break; 
            } 
            memset(scratch->substr, 0, size * sizeof(int)); 
            // match the current string 
            rc = pcre_exec(scratch->re, scratch->extra, str->data, 
            str->length, 0, 0 /* default options */, 
                scratch->substr, size); 
        } 
        switch (rc) { 
          case PCRE_ERROR_BADOPTION: 
            snprintf(SQLUDF_MSGTX, 70, "An unrecognized bit was set " 
                "in the options argument"); 
            strcpy(SQLUDF_STATE, "38903"); 
            rc = -1; 
            break; 
          case PCRE_ERROR_NOMEMORY: 
            snprintf(SQLUDF_MSGTX, 70, "Not enough memory available."); 
            strcpy(SQLUDF_STATE, "38904"); 
            rc = -1; 
            break; 
          case PCRE_ERROR_NOMATCH: 
            scratch->currentSubstr = scratch->numSubstr + 1; 
            rc = 0; 
            break; 
          default: 
            if (rc < 0) { 
                snprintf(SQLUDF_MSGTX, 70, "A regexp match error " 
                    "occured: %d", rc); 
                strcpy(SQLUDF_STATE, "38905"); 
                rc = -1; 
                break; 
            } 
        } 
        break; 
 
      case SQLUDF_TF_FETCH: 
        // skip capturing substrings without a match 
        while (scratch->currentSubstr <= scratch->numSubstr && 
            (scratch->substr[2*scratch->currentSubstr] < 0 || 
                scratch->substr[2*scratch->currentSubstr+1] < 0)) { 
            scratch->currentSubstr++; 
        } 
        // no more data to be returned 
        if (scratch->currentSubstr > scratch->numSubstr) { 
            strcpy(SQLUDF_STATE, SQL_NODATA_EXCEPTION); 
            rc = 0; 
            break; 
        } 
        // get the current substring 
        *pos = scratch->currentSubstr; 
        length = scratch->substr[2*scratch->currentSubstr+1] - 
            scratch->substr[2*scratch->currentSubstr]; 
        strncpy(substr, str->data + scratch->substr[2*scratch->currentSubstr], 
            length); 
        substr[length] = '\0'; 
        scratch->currentSubstr++; 
    } 
 
    // cleanup in CLOSE call, or if we encountered an error in 
    // the OPEN call (DB2 will make a CLOSE call if we encounter 
    // an error in any FETCH call) 
    if (SQLUDF_CALLT == SQLUDF_TF_CLOSE || 
        (SQLUDF_CALLT == SQLUDF_TF_OPEN && rc < 0)) { 
        (*pcre_free)(scratch->re); 
        (*pcre_free)(scratch->extra); 
        free(scratch->substr); 
    } 
    return; 
}

You can also optimize performance as we did for the basic matching function by defining the table function with FINAL CALL. The C code must be modified to handle the SQLUDF_TF_FIRST and SQLUDF_TF_FINAL calls.

Usage examples

A table function can be used in a SELECT statement like the one below:

SELECT c1, str, num, substr 
FROM   strTable, 
       TABLE ( regex3('(\w+)\s*=\s*(\d+|''[\w\s]*'');', str) ) AS sub(num, substr)

The results include only the strings that have a matching pattern. For each string, the first and second captured substring is shown in a separate row.

C1          2                              NUM         4 
----------- ------------------------------ ----------- ----------------------- 
          2 variable = 1234;                         1 variable 
          2 variable = 1234;                         2 1234 
          3 var2 = 'string variable';                1 var2 
          3 var2 = 'string variable';                2 'string variable' 
 
  4 record(s) selected.

The next query returns the pairs of both substrings in separate columns in the result set instead of using separate rows. Thus, a further processing of the strings and their substrings in an SQL statement is easier. The query uses a common table expression (indicated by keyword WITH) to ensure that the evaluation of each string is only done once for the whole query, instead of once for each subselect that is necessary for the intermediate tables s1 and s2.

WITH substrings(c, num, substr) AS 
   ( SELECT c1, num, substr 
     FROM   strTable, 
            TABLE ( regex3('(\w+)\s*=\s*(\d+|''[\w\s]*'');', str) ) 
               AS sub(num, substr) ) 
SELECT t.c1, s1.substr AS variable, s2.substr AS value 
FROM   strTable AS t JOIN substrings AS s1 ON 
          ( t.c1 = s1.c ) JOIN 
       substrings AS s2 ON 
          ( t.c1 = s2.c ) 
WHERE  s1.num = 1 AND s2.num = 2

The same pattern as in the query before was used. Accordingly, the results can be derived from the above table, but this time the variable-value pairs are each in a single row as was requested.

C1          VARIABLE                       VALUE 
----------- ------------------------------ -------------------- 
          2 variable                       1234 
          3 var2                           'string variable' 
 
  2 record(s) selected.

Summary

In this article, I gave a (rather brief) introduction to regular expressions and the string comparison and matching functionality that is available in DB2. I also described why the power of regular expressions can be useful. UDFs can be used to provide regular expressions in DB2 in two forms. In the first form, basic matching, is done by comparing a string with the given pattern. The second form is to implement a table function that extracts the captured substrings from a regular expression and returns those substrings to DB2 so that they can be used in the further processing in the SQL statement. I also gave some hints on how to improve performance.


Disclaimer

This article contains sample code. IBM grants you ("Licensee") a non-exclusive, royalty free, license to use this sample code. However, the sample code is provided as-is and without any warranties, whether EXPRESS OR IMPLIED, INCLUDING ANY IMPLIED WARRANTY OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE OR NON-INFRINGEMENT. IBM AND ITS LICENSORS SHALL NOT BE LIABLE FOR ANY DAMAGES SUFFERED BY LICENSEE THAT RESULT FROM YOUR USE OF THE SOFTWARE. IN NO EVENT WILL IBM OR ITS LICENSORS BE LIABLE FOR ANY LOST REVENUE, PROFIT OR DATA, OR FOR DIRECT, INDIRECT, SPECIAL, CONSEQUENTIAL, INCIDENTAL OR PUNITIVE DAMAGES, HOWEVER CAUSED AND REGARDLESS OF THE THEORY OF LIABILITY, ARISING OUT OF THE USE OF OR INABILITY TO USE SOFTWARE, EVEN IF IBM HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

Resources

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=13981
ArticleTitle=Bringing the Power of Regular Expression Matching to SQL
publish-date=01232003