The following example should give you a clearer understanding of case-insensitive
search. Listing 1 defines a table with an
INTEGER column and an
XML column and inserts 7 rows into that table. Each row contains a small customer document containing an XML element
The values in this element differ in terms of upper and lower case. Some values are all upper case, some are all lower case, and some are mixed case ("First character upper case"). This can happen if this data has come from a variety of applications that use different rules for upper and lower case data entry.
Listing 1. Sample table and data
CREATE TABLE customer (id INTEGER, xmldoc XML); INSERT INTO customer (id, xmldoc) VALUES (1,'<Customer id="1"><city>PARIS</city></Customer>'), (2,'<Customer id="2"><city>Tokyo</city></Customer>'), (3,'<Customer id="3"><city>tokyo</city></Customer>'), (4,'<Customer id="4"><city>PARIS</city></Customer>'), (5,'<Customer id="5"><city>paris</city></Customer>'), (6,'<Customer id="6"><city>Delhi</city></Customer>'), (7,'<Customer id="7"><city>Paris</city></Customer>');
An application that queries these XML documents to find customers with a specific
city most likely requires case-insensitive search. For example, you may want to
find all customers in Paris and hope to retrieve rows 1, 4, 5, and 7. However, if
you search for the value "Paris," only row 7 is returned. To catch all four rows
that you require, you can use the XQuery function
fn:upper-case() to convert the city element values to upper case and
compare them to "PARIS." This is shown in the query in Listing
2 which returns all four customers in Paris.
Listing 2. Select customers in Paris
SELECT id, XMLCAST( XMLQUERY('$XMLDOC/Customer/city') AS VARCHAR(15)) AS city FROM customer WHERE XMLEXISTS('$XMLDOC/Customer[fn:upper-case(city) = "PARIS"]');
If your query uses a parameter marker for the search value, then this parameter
should also be converted to upper case. This is shown in Listing
3. The parameter
marker ("?") is typed as
VARCHAR(15) and passed into the
XQuery predicate as the variable "c."
Listing 3. Select customers using a parameter marker
SELECT id, XMLCAST( XMLQUERY('$XMLDOC/Customer/city') AS VARCHAR(15)) AS city FROM customer WHERE XMLEXISTS('$XMLDOC/Customer[fn:upper-case(city) = fn:upper-case($c)]' PASSING CAST(? AS VARCHAR(15)) AS "c");
Figure 1 shows the output of the sample queries above.
Figure 1. Result of sample query
All this works fine if you are querying only a small amount of data or if your queries also contain other selective predicates such that the upper-case predicate is applied only to a small intermediate result set. The problem is that the predicate with the
fn:upper-case() function prevents the use of XML indexes in DB2. Hence, this approach is not suitable for large amounts of data.
To avoid the use of the
fn:upper-case() function and exploit the benefits of XML indexes, it is necessary to create a case-insensitive database.
DB2 supports locale-aware Unicode collation since Version 9.5 Fixpack 1. This allows
you to ignore case and/or accents. To create a database that is case-insensitive
for all string comparisons, it is necessary to use the collation
UCA500R1 as shown in Listing 4.
Listing 4. Create a case-insensitive database
CREATE DATABASE testdb USING CODESET UTF-8 TERRITORY US COLLATE USING UCA500R1_LEN_S2;
What does the string
UCA500R1_LEN_S2 mean exactly?
UCA500R1 specifies that the default Unicode Collation Algorithm (UCA)
based on the Unicode standard version 5.0.0 is used in this database. Since the default UCA cannot cover the collating sequence of every language supported by Unicode at the same time, the ordering of characters can be customized using optional attributes. The attributes are separated by an underscore (_). The
UCA500R1 keyword plus any attributes form a UCA collation name.
The collation name used in Listing 4 contains two
LEN is the concatenation of
EN (ISO 639-1 language code for English). The second
S2 specifies the strength level which
determines whether case or accent is taken into account when ordering or comparing
strings. In Listing 4 strength level 2 is used so that
"PARIS" and "paris" are equal. Following are examples of other possible values:
UCA500R1_LEN_S1will collate "cliche" = "Cliche" = "cliché"
UCA500R1_LEN_S2will collate "cliche" = "Cliche" < "cliché"
UCA500R1_LEN_S3will collate "cliche" < "Cliche" < "cliché"
A list of all possible combinations to use as a UCA collation name can be found in the DB2 Information Center (see Resources).
Since the database was created using the collation name UCA500R1 with strength level
2, the previous query can now simply be written as if all data would actually be in
upper case, without the function
fn:upper-case() (Listing 5). It does not even matter whether the search string is "Paris" or "PARIS" or any other combination of upper and lower case.
Listing 5. Select customers in Paris
SELECT id, XMLCAST( XMLQUERY('$XMLDOC/Customer/city') AS VARCHAR(15)) AS city FROM customer WHERE XMLEXISTS('$XMLDOC/Customer[city = "PARIS"]');
Figure 2. Result of sample query
If you add an
ORDER BY clause, to order by the extracted
city value, the result set still remains the same: PARIS, paris, and Paris are treated as the same value.
To query this data efficiently, especially if the number of rows in the table is
large, you should create an XML index on the XPath
/Customer/city, as Listing 6 shows:
Listing 6. Create an XML index
CREATE INDEX customer_lang_idx ON test (xmldoc) GENERATE KEY USING XMLPATTERN '/Customer/city' AS SQL VARCHAR(15);
Now, if you explain the query (with Visual Explain or db2exfmt), you see that the index is used for the case-insensitive search:
Figure 3. Explain Plan for querying all customers in Paris in a case-insensitive database
The potential drawback of the approach described in this section is that all the data in all the columns in all tables in the entire database are treated in a case-insensitive manner. It is not possible to restrict the case insensitivity to specific tables or columns. It's all or nothing.
Note that the case insensitivity only applies to element and attribute values, not to the tag names themselves. XML tags and path expressions are still case sensitive. For example, the two XPath expressions
/Customer/city (lower case "c") and
/Customer/City (upper case "C") are different. The latter would not find any elements in our sample data because the
<city> element in our sample data is spelled in lower case.
Using a customized collation for the database may have an impact on the query performance since the number of possible string matches increases when you choose a looser UCA setting. In other words, string comparisons may be slightly more expensive in a case-insensitive database. To check the performance difference between queries in case-sensitive and case-insensitive databases, we created a regular database (case sensitive) and a case-insensitive database. We inserted 20,000 CustAcc documents from the TPoX benchmark and measured a variety of queries in both databases.
For queries that touch only a small to medium number of rows, the performance difference between the two test databases was negligible. We found a bigger difference for queries that touch a large number of rows, such as a table scan over all 20,000 XML documents with string comparison on every document. Such queries took 5% to 8% longer in the case-insensitive database. A small price to pay for case-insensitive search.
There are several methods to search DB2 data in a case-insensitive manner such as the use of generated columns (see Resources). While these methods can work well for relational data, they are not suitable for querying XML data. Case-insensitive handling of XML data is best achieved by creating a database with a customized Unicode collation. This causes all string value comparisons in the database to be case-insensitive and allows using XML indexes and relational indexes as usual. The overhead of increasing the search patterns (regarding the case or accents) proves to be very low.
- Participate in the discussion forum.
in DB2 Universal Database (
pureXML in DB2 9: Which way to query your XML data?, by Matthias Nicola and Fatma Ozcan, to gain information about how to query your XML data.
- In Making DB2 Case-Insensitive, by Blair Adamache, read tips on how you can easily allow your users to search relational data with no worries about case sensitivity.
- Explore A user-defined function for culturally correct collation in DB2 UDB, by Doug Doole, to learn about the user-defined function SORTKEY that provides culturally correct collation for Unicode databases.
- In Case-insensitive string comparisons with DB2 for Linux, UNIX, and Windows, by Knut Stolze, read about techniques that can be employed to implement case-insensitive string comparisons in SQL. Generated columns, extended indexes, and user-defined collating sequences are described in detail.
- Find all Unicode Collation Algorithm based collation attributes, their values, and typical usage examples in the DB2 Information Center.
- Visit the
DB2 pureXML Wiki
to stay up to speed on DB2's XML technology.
Matthias Nicola is a Senior Software Engineer for DB2 pureXML™ at IBM's Silicon Valley Lab. His work focuses on all aspects of XML in DB2, including XQuery, SQL/XML, storage, indexing and performance. Matthias also works closely with customers and business partners, assisting them in the design, implementation, and optimization of XML solutions. Prior to joining IBM, Matthias worked on data warehousing performance for Informix Software. He received his doctorate in computer science from the Technical University of Aachen, Germany.
Martin Sommerlandt is on an 18-month internship at the IBM Silicon Valley Lab. He studies Computer Science at University of Technology Dresden, Germany, and joined IBM in 2007 as an intern to work on DB2 pureXML™ solutions. His interests include XQuery, SQL/XML, XML performance, and customized XML solutions. Martin is an IBM certified DB2 Application Developer and IBM certified DB2 Database Administrator. He has also worked as a tutor for DB2 administration and programming.