Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

Efficient case-insensitive search with DB2 pureXML

Search and order XML data in a case-insensitive manner

Matthias Nicola (mnicola@us.ibm.com), Senior Software Engineer, IBM Silicon Valley Laboratory
Author photo: Matthias Nicola
Matthias Nicola is a Senior Software Engineer for DB2 pureXML™ at IBM's Silicon Valley Lab. His work focuses on all aspects of XML in DB2, including XQuery, SQL/XML, storage, indexing and performance. Matthias also works closely with customers and business partners, assisting them in the design, implementation, and optimization of XML solutions. Prior to joining IBM, Matthias worked on data warehousing performance for Informix Software. He received his doctorate in computer science from the Technical University of Aachen, Germany.
Author Photo: Martin Sommerlandt
Martin Sommerlandt is on an 18-month internship at the IBM Silicon Valley Lab. He studies Computer Science at University of Technology Dresden, Germany, and joined IBM in 2007 as an intern to work on DB2 pureXML™ solutions. His interests include XQuery, SQL/XML, XML performance, and customized XML solutions. Martin is an IBM certified DB2 Application Developer and IBM certified DB2 Database Administrator. He has also worked as a tutor for DB2 administration and programming.

Summary:  The values of XML elements and attributes are, by definition, case sensitive. For example, if you search <city> elements for the value "Paris," you will not find "PARIS" or "paris." This can be solved with XQuery functions such as fn:upper-case(), but the performance will not always be optimal since the use of such functions precludes the use of XML indexes. This article explains how to create a case-insensitive database using DB2® pureXML™ and what behavior to expect for XML queries and XML indexes.

Date:  13 Nov 2008
Level:  Intermediate PDF:  A4 and Letter (74KB | 9 pages)Get Adobe® Reader®
Also available in:   Chinese  Japanese  Vietnamese

Activity:  11049 views
Comments:  

Case-insensitive search with upper and lower case functions

The following example should give you a clearer understanding of case-insensitive search. Listing 1 defines a table with an INTEGER column and an XML column and inserts 7 rows into that table. Each row contains a small customer document containing an XML element <city>.

The values in this element differ in terms of upper and lower case. Some values are all upper case, some are all lower case, and some are mixed case ("First character upper case"). This can happen if this data has come from a variety of applications that use different rules for upper and lower case data entry.


Listing 1. Sample table and data
				
CREATE TABLE customer (id INTEGER, xmldoc XML);

INSERT INTO customer (id, xmldoc)
VALUES (1,'<Customer id="1"><city>PARIS</city></Customer>'),
       (2,'<Customer id="2"><city>Tokyo</city></Customer>'),
       (3,'<Customer id="3"><city>tokyo</city></Customer>'),
       (4,'<Customer id="4"><city>PARIS</city></Customer>'),
       (5,'<Customer id="5"><city>paris</city></Customer>'),
       (6,'<Customer id="6"><city>Delhi</city></Customer>'),
       (7,'<Customer id="7"><city>Paris</city></Customer>');

An application that queries these XML documents to find customers with a specific city most likely requires case-insensitive search. For example, you may want to find all customers in Paris and hope to retrieve rows 1, 4, 5, and 7. However, if you search for the value "Paris," only row 7 is returned. To catch all four rows that you require, you can use the XQuery function fn:upper-case() to convert the city element values to upper case and compare them to "PARIS." This is shown in the query in Listing 2 which returns all four customers in Paris.


Listing 2. Select customers in Paris
				
SELECT id, XMLCAST( XMLQUERY('$XMLDOC/Customer/city') AS VARCHAR(15)) AS city
FROM customer
WHERE XMLEXISTS('$XMLDOC/Customer[fn:upper-case(city) = "PARIS"]');

If your query uses a parameter marker for the search value, then this parameter should also be converted to upper case. This is shown in Listing 3. The parameter marker ("?") is typed as VARCHAR(15) and passed into the XQuery predicate as the variable "c."


Listing 3. Select customers using a parameter marker
				
SELECT id, XMLCAST( XMLQUERY('$XMLDOC/Customer/city') AS VARCHAR(15)) AS city
FROM customer
WHERE XMLEXISTS('$XMLDOC/Customer[fn:upper-case(city) = fn:upper-case($c)]'
                 PASSING CAST(? AS VARCHAR(15)) AS "c");

Figure 1 shows the output of the sample queries above.


Figure 1. Result of sample query
Result of sample query

All this works fine if you are querying only a small amount of data or if your queries also contain other selective predicates such that the upper-case predicate is applied only to a small intermediate result set. The problem is that the predicate with the fn:upper-case() function prevents the use of XML indexes in DB2. Hence, this approach is not suitable for large amounts of data.

To avoid the use of the fn:upper-case() function and exploit the benefits of XML indexes, it is necessary to create a case-insensitive database.

Creating a case-insensitive database in DB2

DB2 supports locale-aware Unicode collation since Version 9.5 Fixpack 1. This allows you to ignore case and/or accents. To create a database that is case-insensitive for all string comparisons, it is necessary to use the collation UCA500R1 as shown in Listing 4.


Listing 4. Create a case-insensitive database
				
CREATE DATABASE testdb
USING CODESET UTF-8 TERRITORY US
COLLATE USING UCA500R1_LEN_S2;

What does the string UCA500R1_LEN_S2 mean exactly? UCA500R1 specifies that the default Unicode Collation Algorithm (UCA) based on the Unicode standard version 5.0.0 is used in this database. Since the default UCA cannot cover the collating sequence of every language supported by Unicode at the same time, the ordering of characters can be customized using optional attributes. The attributes are separated by an underscore (_). The UCA500R1 keyword plus any attributes form a UCA collation name.

The collation name used in Listing 4 contains two attributes: LEN and S2. LEN is the concatenation of L (language) and EN (ISO 639-1 language code for English). The second attribute S2 specifies the strength level which determines whether case or accent is taken into account when ordering or comparing strings. In Listing 4 strength level 2 is used so that "PARIS" and "paris" are equal. Following are examples of other possible values:

  • UCA500R1_LEN_S1 will collate "cliche" = "Cliche" = "cliché"
  • UCA500R1_LEN_S2 will collate "cliche" = "Cliche" < "cliché"
  • UCA500R1_LEN_S3 will collate "cliche" < "Cliche" < "cliché"

A list of all possible combinations to use as a UCA collation name can be found in the DB2 Information Center (see Resources).

Query XML data in a case-insensitive database

Since the database was created using the collation name UCA500R1 with strength level 2, the previous query can now simply be written as if all data would actually be in upper case, without the function fn:upper-case() (Listing 5). It does not even matter whether the search string is "Paris" or "PARIS" or any other combination of upper and lower case.


Listing 5. Select customers in Paris
				
SELECT id, XMLCAST( XMLQUERY('$XMLDOC/Customer/city') AS VARCHAR(15)) AS city
FROM customer
WHERE XMLEXISTS('$XMLDOC/Customer[city = "PARIS"]');


Figure 2. Result of sample query
Result of sample query

If you add an ORDER BY clause, to order by the extracted city value, the result set still remains the same: PARIS, paris, and Paris are treated as the same value.

To query this data efficiently, especially if the number of rows in the table is large, you should create an XML index on the XPath /Customer/city, as Listing 6 shows:


Listing 6. Create an XML index
	
CREATE INDEX customer_lang_idx ON test (xmldoc)
       GENERATE KEY USING XMLPATTERN '/Customer/city' AS SQL VARCHAR(15);

Now, if you explain the query (with Visual Explain or db2exfmt), you see that the index is used for the case-insensitive search:


Figure 3. Explain Plan for querying all customers in Paris in a case-insensitive database
Explain Plan for querying all customers in Paris in a case-insensitive database

The potential drawback of the approach described in this section is that all the data in all the columns in all tables in the entire database are treated in a case-insensitive manner. It is not possible to restrict the case insensitivity to specific tables or columns. It's all or nothing.

Note that the case insensitivity only applies to element and attribute values, not to the tag names themselves. XML tags and path expressions are still case sensitive. For example, the two XPath expressions /Customer/city (lower case "c") and /Customer/City (upper case "C") are different. The latter would not find any elements in our sample data because the <city> element in our sample data is spelled in lower case.

Performance

Using a customized collation for the database may have an impact on the query performance since the number of possible string matches increases when you choose a looser UCA setting. In other words, string comparisons may be slightly more expensive in a case-insensitive database. To check the performance difference between queries in case-sensitive and case-insensitive databases, we created a regular database (case sensitive) and a case-insensitive database. We inserted 20,000 CustAcc documents from the TPoX benchmark and measured a variety of queries in both databases.

For queries that touch only a small to medium number of rows, the performance difference between the two test databases was negligible. We found a bigger difference for queries that touch a large number of rows, such as a table scan over all 20,000 XML documents with string comparison on every document. Such queries took 5% to 8% longer in the case-insensitive database. A small price to pay for case-insensitive search.

Summary

There are several methods to search DB2 data in a case-insensitive manner such as the use of generated columns (see Resources). While these methods can work well for relational data, they are not suitable for querying XML data. Case-insensitive handling of XML data is best achieved by creating a database with a customized Unicode collation. This causes all string value comparisons in the database to be case-insensitive and allows using XML indexes and relational indexes as usual. The overhead of increasing the search patterns (regarding the case or accents) proves to be very low.


Resources

About the authors

Author photo: Matthias Nicola

Matthias Nicola is a Senior Software Engineer for DB2 pureXML™ at IBM's Silicon Valley Lab. His work focuses on all aspects of XML in DB2, including XQuery, SQL/XML, storage, indexing and performance. Matthias also works closely with customers and business partners, assisting them in the design, implementation, and optimization of XML solutions. Prior to joining IBM, Matthias worked on data warehousing performance for Informix Software. He received his doctorate in computer science from the Technical University of Aachen, Germany.

Author Photo: Martin Sommerlandt

Martin Sommerlandt is on an 18-month internship at the IBM Silicon Valley Lab. He studies Computer Science at University of Technology Dresden, Germany, and joined IBM in 2007 as an intern to work on DB2 pureXML™ solutions. His interests include XQuery, SQL/XML, XML performance, and customized XML solutions. Martin is an IBM certified DB2 Application Developer and IBM certified DB2 Database Administrator. He has also worked as a tutor for DB2 administration and programming.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Select information in your profile (name, country/region, and company) is displayed to the public and will accompany any content you post. You may update your IBM account at any time.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management, XML
ArticleID=351866
ArticleTitle=Efficient case-insensitive search with DB2 pureXML
publish-date=11132008
author1-email=mnicola@us.ibm.com
author1-email-cc=
author2-email=msommer@us.ibm.com
author2-email-cc=mnicola@us.ibm.com