 | Level: Intermediate Knut Stolze (stolze@de.ibm.com), DB2 z/OS Utilities Development, IBM Germany
13 Dec 2007 IBM® DB2® for Linux®, Unix®, and
Windows® treats all strings stored in a database as case-sensitive. Some
applications can -- or even must -- ignore the case of letters in strings and just
want to retrieve data based on some given, mixed-case input. This article gives an
overview of techniques that can be employed to implement case-insensitive string comparisons in SQL, while still allowing index-support for such queries. Generated columns, extended indexes, and user-defined collating sequences are described in detail.
Introduction
The SQL standard requires that all strings stored in a relational database are
treated in a case-sensitive manner when compared to each other. That means the
strings 'Some String' and 'SOME
STRING' are not identical and, thus, an equality comparison yields False for such
strings. While this behavior is intuitively understood, there are situations where
an application requires case-insensitive comparisons. Things become even more complicated if the data to be handled already exists in the DB2 database and you do not want to (or cannot) convert it to either upper-case or lower-case.
Remember, DB2 does not provide a global configuration parameter to switch between
case-sensitivity and case-insensitivity. It is the responsibility of the programmer
to handle this in the application. There are different techniques available. A
previous article by Blair Adamache (see Resources) describes
how you can use views and triggers to provide a consistent (upper-case or lower-case
only) view on mixed-case data to your application. It also covers how generated
columns can be adopted in order to provide index-based access to upper-cased data.
Since the previous article is still very much applicable to the current version of
DB2, this article builds on the fundamental topics discussed in Blair's article and extends
them even further. The focus of the current article is to show different ways to exploit DB2 indexes when querying strings in a case-insensitive manner.
I cover an array of topics related to case-insensitive string
comparisons. The article first explains how to employ generated columns to store a copy of a
string in upper-case. It goes on to discuss how the storage required for an
upper-cased version of the strings can be avoided with DB2's index extensions.
User-defined collating sequences can be used if you only want to have
case-insensitive sort order and these are discussed thoroughly. In case you have to
manage large text data, I cover how to manage CLOBs as well.
Generated
columns
DB2 supports generated columns where the values for such columns are created by an
SQL expression. The expression must adhere to a few rules that essentially guarantee
that the generated values are deterministic and do not rely on other tables or other
rows in the same table. Thus, very similar rules as those for check constraints apply.
It is possible to implement mathematical calculations, string operations like
concatenation, or many other operations in a generation expression. DB2 evaluates
the expression whenever one of the input parameters of the expression changes, for
example a value in another column from which the expression derives a new value for
the generated column. By combining generated columns with the DB2 built-in functions UPPER, UCASE, LOWER, and LCASE, you can add columns storing a single-case version of the strings. Listing 1 shows a simple example.
Listing 1. Example for a table with a generated column
CREATE TABLE t (
id INTEGER NOT NULL PRIMARY KEY,
str VARCHAR(500),
ucase_str VARCHAR(500) GENERATED ALWAYS AS ( UPPER(str) )
)@
INSERT INTO t(id, str)
VALUES ( 1, 'Some String' )@
SELECT * FROM t@
ID STR UCASE_STR
----------- ------------------------------------ ------------------------------------
1 Some String SOME STRING
1 record(s) selected.
|
A very interesting feature of generated columns is that the DB2 optimizer analyzes
if a query contains the generation expression. If that is the case, the optimizer
considers a plan where the expression is routed to the generated column. For
example, consider the query SELECT id FROM t WHERE UPPER(str) =
'abc'. The DB2 optimizer notices that there is a generated column with the generation expression UPPER(str) in the table, and it reroutes to the generated column UCASE_STR instead of the original column STR with the function applied on top of it. Figure 1 shows a screenshot of the access plan created by DB2 for this query. As you can see in the details for the TBSCAN operator, the query accesses the generated column. The relevant area is marked with yellow background.
Figure 1. Access plan for "SELECT id FROM t WHERE UPPER(str) = 'abc'"
This example already demonstrates how to use generated columns to query data in a
case-insensitive manner. It is not even necessary that the application is aware of
the additional column. Furthermore, such generated columns are like any other
column, and you can create indexes on them. Any query using the generation expression not only gets the benefits from rerouting, but also from indexing.
Index extensions
DB2 provides an interface that allows you to define your own indexing mechanisms.
While this feature was originally implemented to support the DB2 Spatial Extender, it
can be used for many other purposes as well. I demonstrate how you can exploit extended indexes to get similar benefits as for indexed generated columns but without the impact that an additional column has: duplicating the data and storing it twice.
These days, the data duplication is not much of an issue if you only consider
storage costs. However, any additional column adds to the maximum length of rows,
and that length is limited by the page size of the tablespace in which the table is
placed. Thus, it may not be possible to add a generated column in the first place,
so index extensions can come to the rescue.
The concepts of index extensions were explained in detail in another article (see Resources). Borrowing from this article, here are the major steps to be taken:
- You have to define a so-called key generator function (UDF) that takes as input a value to be stored in the indexed column and returns one or more index entries.
- Internally, all the generated index entries are stored in a B-Tree. Thus, the
index entries will be sorted, and you should have an appropriate sort order for range queries.
- During query time, a range producer function (UDF) is invoked by DB2. That
function generates one or more search ranges for scans on the internal B-Tree
containing the previously generated index keys.
- An index extension object ties key generator, range producer, data type, and any index-specific (constant) parameters together.
- A user-defined predicate has to be attached to a UDF. The DB2 optimizer will consider scanning an extended index whenever the user-defined predicate is used in the where-clause of a query.
- An extended index has to be created on the columns in question.
Index extensions can only operate on distinct types or structured types. Therefore,
the first task is to define such a type. Since the intention is to work with strings,
you must define a distinct data type based on the built-in data type VARCHAR as in Listing 2. You can adjust the length for the base data type (here
4000) to your needs. Note that all columns on which you later want to build index
must use this data type. If you have larger documents that you want to process, you
may want to consider the CLOB data type. Later in the article, I explain how you can use CLOBs and also index them with index extensions.
Listing 2. Creating distinct type
CREATE DISTINCT TYPE String AS VARCHAR(2000) WITH COMPARISONS
|
The next steps are to implement the key generator and range producer functions as
external UDFs. Since indexing is usually a performance-critical part, we wrote those
functions in C code. Both functions are table UDFs. You will find the complete
source code in the zip file coming with this article (see Download). Here, I only introduce briefly what each function does.
The key generator is tailored for index maintenance tasks and takes a string as
input. As a result, it produces a table with a single row and a single column as result, and the one value in that table contains the first bytes of the input string in all upper-case. This result is the value stored in the internal B-Tree. It's length is limited to 500 bytes. As a result, the index usually does not contain the full string. The advantage of abbreviated index keys is that a single index page can store more keys and, thus, provide a higher fan-out, which may yield a lower index tree. But the consequence is that an index scan only returns candidates and not the exact result, and an exact comparison has to be done on those candidates after the index scan.
The range producer functions are required for index searches and they are very
similar to the key generator. Each range producer also returns a single row only.
But instead of creating a single index key, it creates a search range from its
argument. The search range is defined by the start and stop keys (inclusive) with
which an index scan on the internal B-Tree will be performed. I implemented the
operators for equality comparison and between predicates for demonstration purposes.
For equality comparisons, we require point ranges only. In other words, the start
and stop keys for the index scan are identical. So we derive those start and stop
keys from the search argument by folding (up to) the first 500 bytes of the argument
to upper-case and using that value for the start and stop keys. For between
comparisons, we already have a given start and stop value and fold those two to
upper-case and take the first 500 bytes only. If necessary, you can easily add more
range producer functions for other string comparison operators like less-than,
greater-than and others. This excercise is left to the interested reader.
The key generator function and all range producer functions need to be compiled and
linked into a shared library. You can either use the sample script sqllib/samples/c/bldrtn for that, or you execute the compile and link steps manually. On our Linux system, I used the steps in Listing 3. As a test, you can call the key generator and range producer functions in regular SELECT statements and verify their proper function. After all, those are just like any other table function.
Listing 3. Compiling C UDFs and installing shared library in sqllib/function/ directory
$ gcc -m64 -fpic -I$HOME/sqllib/include -Wall -c index-extension.c
$ gcc -m64 -fpic -shared -o index-extension index-extension.o \
-Wl,-rpath,$HOME/sqllib/lib -L$HOME/sqllib/lib -ldb2
$ mv index-extension $HOME/sqllib/function
|
The next step is to define the index extension object. Listing
4 illustrates how this is done. First, the FROM SOURCE KEY
clause specifies how index keys are to be created for a value of type String. The
second part defines the SEARCH METHODS on those index
keys. User-defined predicates attached to UDFs refer to those search methods and
cause the respective range producer function to be invoked. DB2 uses the ranges
returned by the range producer function, scans the index with those ranges, and returns all identified rows as possible candidates for the user-defined predicate for further evaluation. Note that the set of candidates includes false positives but also all correct results.
Listing 4. Defining index extension for case-insensitive string operations
CREATE INDEX EXTENSION iCaseIndex ( )
-- index maintenance
FROM SOURCE KEY ( str String )
GENERATE KEY USING iCaseKeyGen(str)
-- index search
WITH TARGET KEY(keyStr VARCHAR(500))
SEARCH METHODS
WHEN iCaseEquals(str String)
RANGE THROUGH iCaseRangeProd1(str)
WHEN iCaseBetween(start String, stop String)
RANGE THROUGH iCaseRangeProd2(start, stop)
|
The final piece necessary for the infrastructure are the UDFs with the user-defined
predicates. Listing 5 summarizes how such a UDF is to be
created in your database. The PREDICATES clause defines
what should happen if the function is used in a predicate like strcasebetween(indexedColumn, 'startString', 'stopString') = 1. In
such a case, DB2 shall consider any indexes on the column identified by parameter
str1. If those indexes were created with the index extension iCaseIndex, the search
method iCaseBetween will be applicable. In other words, DB2 will use the range producer function iCaseRangeProd2 to generate the ranges for the index scans on the extended index. If no index scan is used (because there is no matching extended index on the column identified by the first parameter, or if the DB2 optimizer does not consider an index scan as beneficial), then DB2 evaluates the RETURN clause for all matching rows. Since the index scan only returns candidates that may match the predicate, the RETURN clause is also evaluated for those candidates to produce the final and correct results.
Listing 5. Creating UDF with user-defined predicates
CREATE FUNCTION strcasebetween(str1 String, strStart String, strStop String)
RETURNS INTEGER
SPECIFIC strcasebetween
LANGUAGE SQL CONTAINS SQL
DETERMINISTIC NO EXTERNAL ACTION
PREDICATES (
WHEN = 1
SEARCH BY INDEX EXTENSION iCaseIndex
WHEN KEY(str1) USE iCaseBetween(strStart, strStop) )
RETURN CASE
WHEN UPPER(VARCHAR(str1)) BETWEEN UPPER(VARCHAR(strStart)) AND
UPPER(VARCHAR(strStop))
THEN 1
ELSE 0
END
|
Now that the index extension and user-defined predicates are defined, you can use
this new functionality. You can create a new table, use the type String for a column in it, insert some values and finally query the table with the functions strcasecmp and strcasebetween. Listing 6 summarizes how to create a simple table, create an extended index and finally use that index in queries that do case-insensitive string comparisons. The results show the first and fourth row, which contain the strings 'abc' and 'ABC' and match the predicate in the where-clause where we search for string 'aBc' in a case-insensitive fashion.
 |
Influencing use of extended indexes
The DB2 optimizer is a cost-based query optimizer and it chooses the cheapest execution
plan for a given SQL statement. If the SQL statement may benefit from extended indexes,
you can influence the DB2 optimizer in various ways, for example by specifying the SELECTIVITY as we have done in Listing 6.
Please refer to the article "DB2 Index Extensions by example and in detail" (see Resources) for a more thorough explanation.
|
|
Listing 6. Using extended index for case-insensitive string comparisons
CREATE TABLE t (
id INTEGER NOT NULL PRIMARY KEY
GENERATED ALWAYS AS IDENTITY,
s String
)@
INSERT INTO t(s)
VALUES ('abc'), ('def'), ('abc001'), ('ABC')@
CREATE INDEX idx ON t(s) EXTEND USING iCaseIndex@
SELECT id FROM t
WHERE strcasecmp(s, String('aBc')) = 1
SELECTIVITY 0.00000001@
ID
-----------
1
4
2 record(s) selected.
|
Figure 2 shows the access plan so that you can verify that the
newly created index, IDX, has been chosen to answer the query. The operator EISCAN
stands for a scan over an extended index, and it is based on the index named IDX,
which was created before with the SQL statements in Listing 6.
Figure 2. Access plan for "SELECT id FROM t WHERE strcasecmp(s, String('aBc')) = 1 SELECTIVITY 0.00000001"
User-defined
collating sequences
DB2 uses collation sequences to determine the order in which strings are sorted.
Each character (code point) in a code page gets a specific weight associated with it.
The weights are only applicable if string values are used to bring the rows in a
result set in a specific order, for example, if an ORDER BY clause occurs in the SQL statement. Collation sequences are composed of an array of 256 weights, one for each code point.
Each DB2 database has a single collation sequence, which can be defined when
executing the CREATE DATABASE command. Thus, the collation sequence is set once and remains fixed during the life of a database. You can set the collation sequence with the COLLATE USING clause. Using the DB2 command line does not allow a user-defined collation sequence, however. You have to use the sqlcrea API in order to provide the list of weights to DB2 when creating a new database.
The function sqlcrea takes the name and alias of the
database, along with the database path and a so-called database descriptor and territory information. The database descriptor can be used to define various properties for the database like database comment, tablespace definitions and also the collation sequence (amongst others). Taking advantage of this feature is straight forward. Based on the sample C++ program sqllib/samples/cpp/dbcreate.C in the DB2 instance directory, I wrote a small application that creates a new database and uses a collation sequence that treats a letter - either in upper-case or lower-case - with the same weight. Listing 7 shows an excerpt of the code.
Listing 7. Creating databases with user-defined collation sequence
// collation sequence treating letters in upper/lower case with same weight
static const unsigned char caseInsensitiveCollation[] = {
...
// upper-case characters retain their original weight
0x40, 0x41, 0x42, 0x43, 0x44, 0x45, 0x46, 0x47,
0x48, 0x49, 0x4A, 0x4B, 0x4C, 0x4D, 0x4E, 0x4F,
0x50, 0x51, 0x52, 0x53, 0x54, 0x55, 0x56, 0x57,
0x58, 0x59, 0x5A, 0x5B, 0x5C, 0x5D, 0x5E, 0x5F,
// lower-case characters (0x61..0x7A) have the same weight as upper-case
// characters (0x41..0x5A)
0x60, 0x41, 0x42, 0x43, 0x44, 0x45, 0x46, 0x47,
0x48, 0x49, 0x4A, 0x4B, 0x4C, 0x4D, 0x4E, 0x4F,
0x50, 0x51, 0x52, 0x53, 0x54, 0x55, 0x56, 0x57,
0x58, 0x59, 0x5A, 0x7B, 0x7C, 0x7D, 0x7E, 0x7F,
...
};
int main(int argc, char *argv[])
{
...
// create database with all default parameters, except by using a
// user-defined collation sequence
struct sqledbdesc dbDescriptor;
SQLEDBTERRITORYINFO territoryInfo;
// initialize DB descriptor
strcpy(dbDescriptor.sqldbdid, SQLE_DBDESC_2);
dbDescriptor.sqldbcss = SQL_CS_USER;
assert(SQL_CS_SZ == sizeof caseInsensitiveCollation);
memcpy(dbDescriptor.sqldbudc, caseInsensitiveCollation,
sizeof caseInsensitiveCollation);
strcpy(dbDescriptor.sqldbcmt, "db with user-defined collation sequence");
...
strcpy(territoryInfo.sqldbcodeset, "ISO8859-1");
strcpy(territoryInfo.sqldblocale, "C");
std::cout << "Creating database with user-defined collation sequence...";
sqlecrea(dbName, dbName, const_cast<char *>(""), &dbDescriptor,
&territoryInfo, '\0', NULL, &sqlca);
...
}
|
The complete source code is included in the zip file attached to this article (see
Download). You can compile this program with any suitable
C++ compiler. I used the GNU compiler gcc on a Linux system to create the program.
Following that, I used it to create a database named COLLDB. Listing 8 demonstrates those steps, along with some SQL statements executed against the newly created database to verify that there is no distinction between upper-case and lower-case letters in the sort order.
Listing 8. Sort order in database with user-defined collation sequence
$ g++ -o createDbCollSeq -I$HOME/sqllib/include createDbCollSeq.cpp \
-L$HOME/sqllib/lib -ldb2
$ ./createDbCollSeq colldb
Attaching to instance 'stolze'...done.
Creating database with user-defined collation sequence...done.
Detaching from instance...done.
$ db2 -td@
db2 => CONNECT TO colldb@
Database Connection Information
Database server = DB2/LINUXX8664 9.1.3
SQL authorization ID = STOLZE
Local database alias = COLLDB
db2 => CREATE TABLE t (
id INTEGER NOT NULL PRIMARY KEY,
str VARCHAR(50)
)@
DB20000I The SQL command completed successfully.
db2 => INSERT INTO t(id, str)
VALUES ( 1, 'Some String' ),
( 2, 'a345' ),
( 3, 'A123' ),
( 4, 'abc' ),
( 5, 'SOME other string' ),
( 6, 'A' ),
( 7, 'a' )@
DB20000I The SQL command completed successfully.
db2 => SELECT * FROM t ORDER BY str@
ID STR
----------- ------------------------------
6 A
7 a
3 A123
2 a345
4 abc
5 SOME other string
1 Some String
7 record(s) selected.
db2 => SELECT * FROM t WHERE SUBSTR(str, 1, 1) = 'a'@
ID STR
----------- --------------------------------------------------
2 a345
4 abc
7 a
3 record(s) selected.
|
Executing the same sequence of SQL statements as in Listing
8 against a database that uses the default collation sequence yields a completely
different result for the first query. All those rows where the string value begins
with lower-case letters (the rows with id 2, 4, and 7) are sorted after all other
rows instead of being intermixed. The second query in Listing
8 demonstrates that the collation sequence only has an impact on sort order but not on
comparison predicates. String comparisons are done regardless of any weight
specifications and only the actual code points are compared. Thus, user-defined
collation sequences are not applicable when querying your data in a case-insensitive
manner and you have to resort to using generated columns or index extensions as explained before.
Handling of CLOB columns
The first two sections described how you deal with strings that are stored as
VARCHARs in your DB2 database. If you want to manage strings that exceed the length
limitations inherent to VARCHARs, you have but one option: store those strings as
CLOBs. However, CLOBs introduce some other obstacles. First, there is no built-in
equality comparison function, so you cannot compare whether the data of two CLOBs is
identical. Another issue is that you cannot create an index extensions on CLOBs or
distinct types derived from CLOBs. LOBs are always directly read from or written to
a disk. Thus, performance of CLOBs tends to be worse than the performance of VARCHAR data, which is cached in the DB2 buffer pools.
Comparison functions on CLOBs can be implemented as external UDFs in a
straight-forward manner. Such a UDF takes two CLOB values, compares the length of
both LOBs, and uses system functions like memcmp to
determine whether the data is also equal. Nevertheless, it still leaves you with the
lack of indexing support, especially if you want to do perform case-insensitive string comparisons.
Generated columns for CLOBs
The first approach to create a generated column that stores an upper-case version of
the string is only possible if you have a UDF that folds the CLOB data to upper-case.
Unfortunately, you cannot create an index on such a column. However, you could extract a portion of the string, store it as VARCHAR, and then index that portion only. This will help you to trim down on query execution times if you first compare this portion for equality, and, if it is equal, compare the whole CLOB. Listing 9 illustrates how such a generated column has to be created.
Listing 9. Creating generated columns on CLOBs
CREATE TABLE clob_data (
id INTEGER NOT NULL PRIMARY KEY,
data CLOB(1m),
head VARCHAR(500) GENERATED ALWAYS AS ( UPPER(VARCHAR(data, 500)) )
)@
...
SELECT *
FROM clob_data
WHERE UPPER(VARCHAR(data, 500)) = UPPER(VARCHAR(search_arg, 500)) AND
ClobCaseCmp(data, search_arg) = 1
|
In order to take advantage of that column, it is not sufficient to apply the UDF ClobCaseCmp, which does a case-insensitive comparison of two CLOB values, but you also have to use the generation expression (or the generated column itself). Naturally, this is not a very desirable approach because it makes queries more complicated.
Index extensions for CLOBs
Index extensions as described in a section above offer another way to deal with
CLOBs. One requirement of index extensions is that they can only be used with
distinct types or structured types. You can take advantage of that with structured
types and earn some additional benefits along the way. First, the string data type
is created in Listing 10 as structured type. It is
comprised of two attributes: one attribute to hold the actual CLOB data, and the
other attribute contains only the first 500 bytes of the CLOB - folded to upper-case
already. In order to ensure consistency between both attributes, a constructor
function is also created.
Listing 10. Creating generated columns on CLOBs
CREATE TYPE String
AS (
data CLOB(1M),
head VARCHAR(500) )
WITHOUT COMPARISONS
MODE DB2SQL@
CREATE FUNCTION String(str CLOB(1M))
RETURNS String
LANGUAGE SQL CONTAINS SQL
DETERMINISTIC NO EXTERNAL ACTION
RETURN String()..data(str)..head(UPPER(VARCHAR(SUBSTR(str, 1, 500))))@
|
Compared to a pure CLOB storage, our structured type has another advantage for small
documents. Each structured type has a so-called inline-length associated with
it. For all values whose size does not exceed this length, DB2 chooses a storage of
the value that is similar to VARCHAR FOR BIT DATA, for example, the value will be stored inline and go through the DB2 buffer pool. For larger values, the LOB storage mechanisms is exploited. Thus, wrapping a LOB in a structured type may result in better performance if many LOBs are small and disk access can be avoided. A more detailed explanation on the inline length can be found in the article "DB2 Spatial Extender performance tuning" (see Resources).
The index extension can now be created on this type pretty much as you did before
for the VARCHAR-based data type. The only difference is that you create the index keys based on the value stored in the attribute "head". The actual implementation of the key generator and range producer functions does not have to be changed at all. Only the CREATE INDEX EXTENSION statement is affected.
Like the VARCHAR strings, I implemented two functions to compare two strings and
attached the user-defined predicates to those functions. While it was possible to
write those functions with pure SQL before, you now have to resort to external code.
You deal now with structured types, so it becomes necessary to add a transform
function to let DB2 know how values of this structured type are to be passed to the external code. You can find the completed code in the file index-extension.structured.sql (see Download).
Summary
This article gave you an introduction to how case-insensitive string operations can
be implemented in DB2. Generated columns, in conjunction with views, deliver a
viable way to hide the fact that SQL treats all strings as being case-sensitive,
while still supporting indexes for efficient queries. Another approach can be found
with index extensions where you don't incur the storage overhead of generated
columns. In case you are only interested in sorting your string data
case-insensitive, but treating it as case-sensitive otherwise, you may want to
consider a user-defined collation sequence. Finally, the article gave a short
summary on how to deal with longer strings that cannot be stored as VARCHAR. You can use CLOBs instead, at the expense of a slightly more complex setup.
Download | Description | Name | Size | Download method |
|---|
| Sample SQL scripts | icase.zip | 8KB | HTTP |
|---|
Resources
About the author  | 
|  | Knut started his work with DB2 when he joined IBM as a
visiting scientist at the Silicon Valley Lab, USA, in 1999, where he worked on
the DB2 Image Extender. He moved on to the DB2 Spatial Extender Development
team and was responsible for several enhancements to improve usability,
performance, and standard-conformance of the Extender. From 2002 through 2006
he was a PhD student and teaching assistant at the University of Jena,
Germany. At the same time, he continued to work for IBM on the
Information Integrator development. Today, Knut is a member of the
development team responsible for the Admin Enablement functionality on DB2 for
z/OS. He can be reached through the newsgroups comp.databases.ibm-db2 and
ibm.software.db2.udb.spatial or through email at stolze@de.ibm.com. |
Rate this page
|  |