Contents


Case-insensitive string comparisons with DB2 for Linux, UNIX, and Windows

Various ways to handle SQL strings in a case-insensitive way

Comments

The SQL standard requires that all strings stored in a relational database are treated in a case-sensitive manner when compared to each other. That means the strings 'Some String' and 'SOME STRING' are not identical and, thus, an equality comparison yields False for such strings. While this behavior is intuitively understood, there are situations where an application requires case-insensitive comparisons. Things become even more complicated if the data to be handled already exists in the DB2 database and you do not want to (or cannot) convert it to either upper-case or lower-case.

Remember, DB2 does not provide a global configuration parameter to switch between case-sensitivity and case-insensitivity. It is the responsibility of the programmer to handle this in the application. There are different techniques available. A previous article by Blair Adamache (see Related topics) describes how you can use views and triggers to provide a consistent (upper-case or lower-case only) view on mixed-case data to your application. It also covers how generated columns can be adopted in order to provide index-based access to upper-cased data. Since the previous article is still very much applicable to the current version of DB2, this article builds on the fundamental topics discussed in Blair's article and extends them even further. The focus of the current article is to show different ways to exploit DB2 indexes when querying strings in a case-insensitive manner.

I cover an array of topics related to case-insensitive string comparisons. The article first explains how to employ generated columns to store a copy of a string in upper-case. It goes on to discuss how the storage required for an upper-cased version of the strings can be avoided with DB2's index extensions. User-defined collating sequences can be used if you only want to have case-insensitive sort order and these are discussed thoroughly. In case you have to manage large text data, I cover how to manage CLOBs as well.

Generated columns

DB2 supports generated columns where the values for such columns are created by an SQL expression. The expression must adhere to a few rules that essentially guarantee that the generated values are deterministic and do not rely on other tables or other rows in the same table. Thus, very similar rules as those for check constraints apply.

It is possible to implement mathematical calculations, string operations like concatenation, or many other operations in a generation expression. DB2 evaluates the expression whenever one of the input parameters of the expression changes, for example a value in another column from which the expression derives a new value for the generated column. By combining generated columns with the DB2 built-in functions UPPER, UCASE, LOWER, and LCASE, you can add columns storing a single-case version of the strings. Listing 1 shows a simple example.

Listing 1. Example for a table with a generated column
CREATE TABLE t (
   id          INTEGER  NOT NULL  PRIMARY KEY,
   str         VARCHAR(500),
   ucase_str   VARCHAR(500)  GENERATED ALWAYS AS ( UPPER(str) )
)@

INSERT INTO t(id, str)
VALUES ( 1, 'Some String' )@

SELECT * FROM t@

ID          STR                                  UCASE_STR
----------- ------------------------------------ ------------------------------------
          1 Some String                          SOME STRING

  1 record(s) selected.

A very interesting feature of generated columns is that the DB2 optimizer analyzes if a query contains the generation expression. If that is the case, the optimizer considers a plan where the expression is routed to the generated column. For example, consider the query SELECT id FROM t WHERE UPPER(str) = 'abc'. The DB2 optimizer notices that there is a generated column with the generation expression UPPER(str) in the table, and it reroutes to the generated column UCASE_STR instead of the original column STR with the function applied on top of it. Figure 1 shows a screenshot of the access plan created by DB2 for this query. As you can see in the details for the TBSCAN operator, the query accesses the generated column. The relevant area is marked with yellow background.

Figure 1. Access plan for "SELECT id FROM t WHERE UPPER(str) = 'abc'"
Screenshot showing access plan
Screenshot showing access plan

This example already demonstrates how to use generated columns to query data in a case-insensitive manner. It is not even necessary that the application is aware of the additional column. Furthermore, such generated columns are like any other column, and you can create indexes on them. Any query using the generation expression not only gets the benefits from rerouting, but also from indexing.

Index extensions

DB2 provides an interface that allows you to define your own indexing mechanisms. While this feature was originally implemented to support the DB2 Spatial Extender, it can be used for many other purposes as well. I demonstrate how you can exploit extended indexes to get similar benefits as for indexed generated columns but without the impact that an additional column has: duplicating the data and storing it twice.

These days, the data duplication is not much of an issue if you only consider storage costs. However, any additional column adds to the maximum length of rows, and that length is limited by the page size of the tablespace in which the table is placed. Thus, it may not be possible to add a generated column in the first place, so index extensions can come to the rescue.

The concepts of index extensions were explained in detail in another article (see Related topics). Borrowing from this article, here are the major steps to be taken:

  1. You have to define a so-called key generator function (UDF) that takes as input a value to be stored in the indexed column and returns one or more index entries.
  2. Internally, all the generated index entries are stored in a B-Tree. Thus, the index entries will be sorted, and you should have an appropriate sort order for range queries.
  3. During query time, a range producer function (UDF) is invoked by DB2. That function generates one or more search ranges for scans on the internal B-Tree containing the previously generated index keys.
  4. An index extension object ties key generator, range producer, data type, and any index-specific (constant) parameters together.
  5. A user-defined predicate has to be attached to a UDF. The DB2 optimizer will consider scanning an extended index whenever the user-defined predicate is used in the where-clause of a query.
  6. An extended index has to be created on the columns in question.

Index extensions can only operate on distinct types or structured types. Therefore, the first task is to define such a type. Since the intention is to work with strings, you must define a distinct data type based on the built-in data type VARCHAR as in Listing 2. You can adjust the length for the base data type (here 4000) to your needs. Note that all columns on which you later want to build index must use this data type. If you have larger documents that you want to process, you may want to consider the CLOB data type. Later in the article, I explain how you can use CLOBs and also index them with index extensions.

Listing 2. Creating distinct type
CREATE DISTINCT TYPE String AS VARCHAR(2000) WITH COMPARISONS

The next steps are to implement the key generator and range producer functions as external UDFs. Since indexing is usually a performance-critical part, we wrote those functions in C code. Both functions are table UDFs. You will find the complete source code in the zip file coming with this article (see Download). Here, I only introduce briefly what each function does.

The key generator is tailored for index maintenance tasks and takes a string as input. As a result, it produces a table with a single row and a single column as result, and the one value in that table contains the first bytes of the input string in all upper-case. This result is the value stored in the internal B-Tree. It's length is limited to 500 bytes. As a result, the index usually does not contain the full string. The advantage of abbreviated index keys is that a single index page can store more keys and, thus, provide a higher fan-out, which may yield a lower index tree. But the consequence is that an index scan only returns candidates and not the exact result, and an exact comparison has to be done on those candidates after the index scan.

The range producer functions are required for index searches and they are very similar to the key generator. Each range producer also returns a single row only. But instead of creating a single index key, it creates a search range from its argument. The search range is defined by the start and stop keys (inclusive) with which an index scan on the internal B-Tree will be performed. I implemented the operators for equality comparison and between predicates for demonstration purposes. For equality comparisons, we require point ranges only. In other words, the start and stop keys for the index scan are identical. So we derive those start and stop keys from the search argument by folding (up to) the first 500 bytes of the argument to upper-case and using that value for the start and stop keys. For between comparisons, we already have a given start and stop value and fold those two to upper-case and take the first 500 bytes only. If necessary, you can easily add more range producer functions for other string comparison operators like less-than, greater-than and others. This excercise is left to the interested reader.

The key generator function and all range producer functions need to be compiled and linked into a shared library. You can either use the sample script sqllib/samples/c/bldrtn for that, or you execute the compile and link steps manually. On our Linux system, I used the steps in Listing 3. As a test, you can call the key generator and range producer functions in regular SELECT statements and verify their proper function. After all, those are just like any other table function.

Listing 3. Compiling C UDFs and installing shared library in sqllib/function/ directory
$ gcc -m64 -fpic -I$HOME/sqllib/include -Wall -c index-extension.c
$ gcc -m64 -fpic -shared -o index-extension index-extension.o \
        -Wl,-rpath,$HOME/sqllib/lib -L$HOME/sqllib/lib -ldb2
$ mv index-extension $HOME/sqllib/function

The next step is to define the index extension object. Listing 4 illustrates how this is done. First, the FROM SOURCE KEY clause specifies how index keys are to be created for a value of type String. The second part defines the SEARCH METHODS on those index keys. User-defined predicates attached to UDFs refer to those search methods and cause the respective range producer function to be invoked. DB2 uses the ranges returned by the range producer function, scans the index with those ranges, and returns all identified rows as possible candidates for the user-defined predicate for further evaluation. Note that the set of candidates includes false positives but also all correct results.

Listing 4. Defining index extension for case-insensitive string operations
CREATE INDEX EXTENSION iCaseIndex ( )
   -- index maintenance
   FROM SOURCE KEY ( str String )
      GENERATE KEY USING iCaseKeyGen(str)
   -- index search
   WITH TARGET KEY(keyStr VARCHAR(500))
   SEARCH METHODS
      WHEN iCaseEquals(str String)
         RANGE THROUGH iCaseRangeProd1(str)
      WHEN iCaseBetween(start String, stop String)
         RANGE THROUGH iCaseRangeProd2(start, stop)

The final piece necessary for the infrastructure are the UDFs with the user-defined predicates. Listing 5 summarizes how such a UDF is to be created in your database. The PREDICATES clause defines what should happen if the function is used in a predicate like strcasebetween(indexedColumn, 'startString', 'stopString') = 1. In such a case, DB2 shall consider any indexes on the column identified by parameter str1. If those indexes were created with the index extension iCaseIndex, the search method iCaseBetween will be applicable. In other words, DB2 will use the range producer function iCaseRangeProd2 to generate the ranges for the index scans on the extended index. If no index scan is used (because there is no matching extended index on the column identified by the first parameter, or if the DB2 optimizer does not consider an index scan as beneficial), then DB2 evaluates the RETURN clause for all matching rows. Since the index scan only returns candidates that may match the predicate, the RETURN clause is also evaluated for those candidates to produce the final and correct results.

Listing 5. Creating UDF with user-defined predicates
CREATE FUNCTION strcasebetween(str1 String, strStart String, strStop String)
   RETURNS INTEGER
   SPECIFIC strcasebetween
   LANGUAGE SQL  CONTAINS SQL
   DETERMINISTIC  NO EXTERNAL ACTION
   PREDICATES (
      WHEN = 1
      SEARCH BY INDEX EXTENSION iCaseIndex
         WHEN KEY(str1) USE iCaseBetween(strStart, strStop) )
   RETURN CASE
             WHEN UPPER(VARCHAR(str1)) BETWEEN UPPER(VARCHAR(strStart)) AND
                                               UPPER(VARCHAR(strStop))
             THEN 1
             ELSE 0
          END

Now that the index extension and user-defined predicates are defined, you can use this new functionality. You can create a new table, use the type String for a column in it, insert some values and finally query the table with the functions strcasecmp and strcasebetween. Listing 6 summarizes how to create a simple table, create an extended index and finally use that index in queries that do case-insensitive string comparisons. The results show the first and fourth row, which contain the strings 'abc' and 'ABC' and match the predicate in the where-clause where we search for string 'aBc' in a case-insensitive fashion.

Listing 6. Using extended index for case-insensitive string comparisons
CREATE TABLE t (
   id  INTEGER  NOT NULL  PRIMARY KEY  
          GENERATED ALWAYS AS IDENTITY,
   s   String
)@
INSERT INTO t(s) 
VALUES ('abc'), ('def'), ('abc001'), ('ABC')@

CREATE INDEX idx ON t(s) EXTEND USING iCaseIndex@

SELECT id FROM t 
WHERE strcasecmp(s, String('aBc')) = 1 
         SELECTIVITY 0.00000001@

ID
-----------
          1
          4

  2 record(s) selected.

Figure 2 shows the access plan so that you can verify that the newly created index, IDX, has been chosen to answer the query. The operator EISCAN stands for a scan over an extended index, and it is based on the index named IDX, which was created before with the SQL statements in Listing 6.

Figure 2. Access plan for "SELECT id FROM t WHERE strcasecmp(s, String('aBc')) = 1 SELECTIVITY 0.00000001"
Screenshot showing access plan
Screenshot showing access plan

User-defined collating sequences

DB2 uses collation sequences to determine the order in which strings are sorted. Each character (code point) in a code page gets a specific weight associated with it. The weights are only applicable if string values are used to bring the rows in a result set in a specific order, for example, if an ORDER BY clause occurs in the SQL statement. Collation sequences are composed of an array of 256 weights, one for each code point.

Each DB2 database has a single collation sequence, which can be defined when executing the CREATE DATABASE command. Thus, the collation sequence is set once and remains fixed during the life of a database. You can set the collation sequence with the COLLATE USING clause. Using the DB2 command line does not allow a user-defined collation sequence, however. You have to use the sqlcrea API in order to provide the list of weights to DB2 when creating a new database.

The function sqlcrea takes the name and alias of the database, along with the database path and a so-called database descriptor and territory information. The database descriptor can be used to define various properties for the database like database comment, tablespace definitions and also the collation sequence (amongst others). Taking advantage of this feature is straight forward. Based on the sample C++ program sqllib/samples/cpp/dbcreate.C in the DB2 instance directory, I wrote a small application that creates a new database and uses a collation sequence that treats a letter - either in upper-case or lower-case - with the same weight. Listing 7 shows an excerpt of the code.

Listing 7. Creating databases with user-defined collation sequence
// collation sequence treating letters in upper/lower case with same weight
static const unsigned char caseInsensitiveCollation[] = {
    ...
    // upper-case characters retain their original weight
    0x40, 0x41, 0x42, 0x43, 0x44, 0x45, 0x46, 0x47,
    0x48, 0x49, 0x4A, 0x4B, 0x4C, 0x4D, 0x4E, 0x4F,
    0x50, 0x51, 0x52, 0x53, 0x54, 0x55, 0x56, 0x57,
    0x58, 0x59, 0x5A, 0x5B, 0x5C, 0x5D, 0x5E, 0x5F,
    // lower-case characters (0x61..0x7A) have the same weight as upper-case
    // characters (0x41..0x5A)
    0x60, 0x41, 0x42, 0x43, 0x44, 0x45, 0x46, 0x47,
    0x48, 0x49, 0x4A, 0x4B, 0x4C, 0x4D, 0x4E, 0x4F,
    0x50, 0x51, 0x52, 0x53, 0x54, 0x55, 0x56, 0x57,
    0x58, 0x59, 0x5A, 0x7B, 0x7C, 0x7D, 0x7E, 0x7F,
    ...
};


int main(int argc, char *argv[])
{
    ...

    // create database with all default parameters, except by using a
    // user-defined collation sequence
    struct sqledbdesc dbDescriptor;
    SQLEDBTERRITORYINFO territoryInfo;

    // initialize DB descriptor
    strcpy(dbDescriptor.sqldbdid, SQLE_DBDESC_2);
    dbDescriptor.sqldbcss = SQL_CS_USER;
    assert(SQL_CS_SZ == sizeof caseInsensitiveCollation);
    memcpy(dbDescriptor.sqldbudc, caseInsensitiveCollation, 
            sizeof caseInsensitiveCollation);
    strcpy(dbDescriptor.sqldbcmt, "db with user-defined collation sequence");
    ...
    strcpy(territoryInfo.sqldbcodeset, "ISO8859-1");
    strcpy(territoryInfo.sqldblocale, "C");

    std::cout << "Creating database with user-defined collation sequence...";
    sqlecrea(dbName, dbName, const_cast<char *>(""), &dbDescriptor,
            &territoryInfo, '\0', NULL, &sqlca);

    ...
}

The complete source code is included in the zip file attached to this article (see Download). You can compile this program with any suitable C++ compiler. I used the GNU compiler gcc on a Linux system to create the program. Following that, I used it to create a database named COLLDB. Listing 8 demonstrates those steps, along with some SQL statements executed against the newly created database to verify that there is no distinction between upper-case and lower-case letters in the sort order.

Listing 8. Sort order in database with user-defined collation sequence
$ g++ -o createDbCollSeq -I$HOME/sqllib/include createDbCollSeq.cpp \
        -L$HOME/sqllib/lib -ldb2
$ ./createDbCollSeq colldb
Attaching to instance 'stolze'...done.
Creating database with user-defined collation sequence...done.
Detaching from instance...done.
$ db2 -td@
db2 => CONNECT TO colldb@

   Database Connection Information

 Database server        = DB2/LINUXX8664 9.1.3
 SQL authorization ID   = STOLZE
 Local database alias   = COLLDB

db2 => CREATE TABLE t (
          id          INTEGER  NOT NULL  PRIMARY KEY,
          str         VARCHAR(50)
       )@
DB20000I  The SQL command completed successfully.

db2 => INSERT INTO t(id, str)
       VALUES ( 1, 'Some String' ),
              ( 2, 'a345' ),
              ( 3, 'A123' ),
              ( 4, 'abc' ),
              ( 5, 'SOME other string' ),
              ( 6, 'A' ),
              ( 7, 'a' )@
DB20000I  The SQL command completed successfully.

db2 => SELECT * FROM t ORDER BY str@

ID          STR
----------- ------------------------------
          6 A
          7 a
          3 A123
          2 a345
          4 abc
          5 SOME other string
          1 Some String

  7 record(s) selected.

db2 => SELECT * FROM t WHERE SUBSTR(str, 1, 1) = 'a'@

ID          STR
----------- --------------------------------------------------
          2 a345
          4 abc
          7 a

  3 record(s) selected.

Executing the same sequence of SQL statements as in Listing 8 against a database that uses the default collation sequence yields a completely different result for the first query. All those rows where the string value begins with lower-case letters (the rows with id 2, 4, and 7) are sorted after all other rows instead of being intermixed. The second query in Listing 8 demonstrates that the collation sequence only has an impact on sort order but not on comparison predicates. String comparisons are done regardless of any weight specifications and only the actual code points are compared. Thus, user-defined collation sequences are not applicable when querying your data in a case-insensitive manner and you have to resort to using generated columns or index extensions as explained before.

Handling of CLOB columns

The first two sections described how you deal with strings that are stored as VARCHARs in your DB2 database. If you want to manage strings that exceed the length limitations inherent to VARCHARs, you have but one option: store those strings as CLOBs. However, CLOBs introduce some other obstacles. First, there is no built-in equality comparison function, so you cannot compare whether the data of two CLOBs is identical. Another issue is that you cannot create an index extensions on CLOBs or distinct types derived from CLOBs. LOBs are always directly read from or written to a disk. Thus, performance of CLOBs tends to be worse than the performance of VARCHAR data, which is cached in the DB2 buffer pools.

Comparison functions on CLOBs can be implemented as external UDFs in a straight-forward manner. Such a UDF takes two CLOB values, compares the length of both LOBs, and uses system functions like memcmp to determine whether the data is also equal. Nevertheless, it still leaves you with the lack of indexing support, especially if you want to do perform case-insensitive string comparisons.

Generated columns for CLOBs

The first approach to create a generated column that stores an upper-case version of the string is only possible if you have a UDF that folds the CLOB data to upper-case. Unfortunately, you cannot create an index on such a column. However, you could extract a portion of the string, store it as VARCHAR, and then index that portion only. This will help you to trim down on query execution times if you first compare this portion for equality, and, if it is equal, compare the whole CLOB. Listing 9 illustrates how such a generated column has to be created.

Listing 9. Creating generated columns on CLOBs
CREATE TABLE clob_data (
   id     INTEGER  NOT NULL  PRIMARY KEY,
   data   CLOB(1m),
   head   VARCHAR(500) GENERATED ALWAYS AS ( UPPER(VARCHAR(data, 500)) )
)@

...

SELECT *
FROM   clob_data
WHERE  UPPER(VARCHAR(data, 500)) = UPPER(VARCHAR(search_arg, 500)) AND
       ClobCaseCmp(data, search_arg) = 1

In order to take advantage of that column, it is not sufficient to apply the UDF ClobCaseCmp, which does a case-insensitive comparison of two CLOB values, but you also have to use the generation expression (or the generated column itself). Naturally, this is not a very desirable approach because it makes queries more complicated.

Index extensions for CLOBs

Index extensions as described in a section above offer another way to deal with CLOBs. One requirement of index extensions is that they can only be used with distinct types or structured types. You can take advantage of that with structured types and earn some additional benefits along the way. First, the string data type is created in Listing 10 as structured type. It is comprised of two attributes: one attribute to hold the actual CLOB data, and the other attribute contains only the first 500 bytes of the CLOB - folded to upper-case already. In order to ensure consistency between both attributes, a constructor function is also created.

Listing 10. Creating generated columns on CLOBs
CREATE TYPE String
   AS (
      data  CLOB(1M),
      head  VARCHAR(500) )
   WITHOUT COMPARISONS
   MODE DB2SQL@

CREATE FUNCTION String(str CLOB(1M))
   RETURNS String
   LANGUAGE SQL  CONTAINS SQL
   DETERMINISTIC  NO EXTERNAL ACTION   
   RETURN String()..data(str)..head(UPPER(VARCHAR(SUBSTR(str, 1, 500))))@

Compared to a pure CLOB storage, our structured type has another advantage for small documents. Each structured type has a so-called inline-length associated with it. For all values whose size does not exceed this length, DB2 chooses a storage of the value that is similar to VARCHAR FOR BIT DATA, for example, the value will be stored inline and go through the DB2 buffer pool. For larger values, the LOB storage mechanisms is exploited. Thus, wrapping a LOB in a structured type may result in better performance if many LOBs are small and disk access can be avoided. A more detailed explanation on the inline length can be found in the article "DB2 Spatial Extender performance tuning" (see Related topics).

The index extension can now be created on this type pretty much as you did before for the VARCHAR-based data type. The only difference is that you create the index keys based on the value stored in the attribute "head". The actual implementation of the key generator and range producer functions does not have to be changed at all. Only the CREATE INDEX EXTENSION statement is affected.

Like the VARCHAR strings, I implemented two functions to compare two strings and attached the user-defined predicates to those functions. While it was possible to write those functions with pure SQL before, you now have to resort to external code. You deal now with structured types, so it becomes necessary to add a transform function to let DB2 know how values of this structured type are to be passed to the external code. You can find the completed code in the file index-extension.structured.sql (see Download).

Summary

This article gave you an introduction to how case-insensitive string operations can be implemented in DB2. Generated columns, in conjunction with views, deliver a viable way to hide the fact that SQL treats all strings as being case-sensitive, while still supporting indexes for efficient queries. Another approach can be found with index extensions where you don't incur the storage overhead of generated columns. In case you are only interested in sorting your string data case-insensitive, but treating it as case-sensitive otherwise, you may want to consider a user-defined collation sequence. Finally, the article gave a short summary on how to deal with longer strings that cannot be stored as VARCHAR. You can use CLOBs instead, at the expense of a slightly more complex setup.


Downloadable resources


Related topics


Comments

Sign in or register to add and subscribe to comments.

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=277092
ArticleTitle=Case-insensitive string comparisons with DB2 for Linux, UNIX, and Windows
publish-date=12132007