Contents


Perform case-insensitive queries in Informix

Comments

IBM Informix databases have traditionally been case-sensitive. This means that if you entered upper case "IBM" and lower case "ibm" in separate rows in a CHAR column, and then queried for the lower case version "ibm," only the lower case value would be returned. If you wanted to search CHAR data in a case-insensitive fashion, you had few options. You could create a functional index, use a basic text search index, or use UPPER or LOWER built-in functions; but all of these options would require application changes or possible performance degradation.

With Informix 11.70.xc2, for the first time Informix can now support case-insensitive (CI) database functionality. Why would you want CI searching? Consider if you had a column describing the company name for your clients. If you had multiple values representing the same company, one upper case and one lower case, such as "IBM" and "ibm," both represent accurate data that you would want returned in any such query. With new support for CI searching in Informix 11.70.xc2, you can now achieve such results.

Introducing case insensitivity in Informix

Informix offers case insensitivity through two datatypes: NCHAR and NVARCHAR, used in conjunction with an Informix CI database. All other types within this database remain case-sensitive, so if you wish to use the CI feature you must use these datatypes.

If you want support for the CI feature, you will need to establish it when you create the database. The following command shows how to create a CI database:

CREATE DATABASE  mydb WITH LOG NLSCASE INSENSITIVE;

With this statement, you have created a database called 'mydb' that will allow all NCHAR and NVARCHAR columns created inside this database to behave in a CI fashion.

Sometimes you will have a database already created on your server. Before you enter or query data, it would be useful to know if that database supports CI or not. There is a column in the sysmaster:sysdatabases table that provides the answer. When is_case_insens is set to 1, this indicates that the database is CI. Listing 1 shows how to query this system table to find out whether or not the database supports CI:

Listing 1. How to know if your database is CI or not
SELECT is_case_insens FROM sysmaster:sysdatabases WHERE name LIKE  'mydb';

is_case_insens
            1
1 row(s) retrieved.

Let's create a basic schema inside our CI database 'mydb' and a variation of the stores database to show how we can utilize CI functionality. We will use both NCHAR and NVARCHAR datatypes, because these are the only supported types for CI.

Listing 2. Create a basic stores schema using NCHAR and NVARCHAR datatypes
CREATE TABLE customer
        (
        customer_num               NCHAR(4),
        fname                      NCHAR(15),
        lname                      NCHAR(15),
        company                    NCHAR(20),
        address1                   NVARCHAR(20),
        address2                   NVARCHAR(20),
        city                       NCHAR(15),
        state                      NCHAR(2),
        zipcode                    NCHAR(5),
        phone                      NCHAR(18),
       PRIMARY KEY (customer_num)
        );
        
INSERT INTO customer VALUES('101a','LUDWIG','pauli','All SportS Supplies','213 erstWild 
Court',null,
'SunnyVale','CA','94086','408-789-8075');
INSERT INTO customer VALUES('102b','Carole','Sadler','Sports Spot','785 GearY St',null,
'San Francisco'
,'CA','94117','415-822-1289');
INSERT INTO customer VALUES('103c','Philip','Currie','Phil Sports','654 Poplar,P. o. boX 
3498',null,'Palo AlTo'
,'Ca','94303','415-328-4543');
INSERT INTO customer VALUES('104d','AnthonY','Higgins','PlaY ball!','east Shopping Cntr.
','422 baY Road',
'Redwood CitY','CA','94026','415-368-1100');
INSERT INTO customer VALUES('105e','RaYmond','Vector','Los Altos Sports','1899 La Loma 
Drive',null,
'Los Altos','ca','94022','415-776-3249');

CREATE TABLE cust_calls
       (
       customer_num                NCHAR(4),
       call_dtime                  datetime year to minute,
       user_id                     NCHAR(32) default user,
       call_code                   NCHAR(1),
       call_descr                  CHAR(240),
       res_dtime                   datetime year to minute,
       res_descr                   NVARCHAR(240),
       PRIMARY KEY(customer_num, call_dtime),
       FOREIGN KEY (customer_num) REFERENCES customer (customer_num)
        );

INSERT INTO cust_calls VALUES('101A','2008-06-12 8:20','maryj','D','order was received
' ,'2008-06-12 8:25','
Authorized credit for two cans to customer');
INSERT INTO cust_calls VALUES('102B','2008-07-07 10:24','richc','L','order placed 
one month ago (6/7) not 
received.','2008-07-07 10:30','Checked with shipping (ed Smith)');
INSERT INTO cust_calls VALUES('103C','2008-07-01 15:00','richc','B','Richc' ,
'2008-07-02 8:21','Is sending 
new bill to customer');
INSERT INTO cust_calls VALUES('104D','2008-07-10 14:05','maryj','o','Customer 
likes our merchandise.',
null,'Sent note to marketing');
INSERT INTO cust_calls VALUES('105E','2008-07-31 14:30','maryj','I','Received
 Hero watches',null,'Sent 
memo to shipping');

CREATE TABLE orders
        (
        order_num                   serial(1001),
        order_date                  date,
        customer_num                NCHAR(4),
        ship_instruct               NVARCHAR(40),
        backlog                     NCHAR(1),
        po_num                      NCHAR(10),
        ship_date                   date,
        ship_weight                 decimal(8,2),
        ship_CHARge                 money(6),
        paid_date                   date,
        PRIMARY KEY (order_num),
        FOREIGN KEY (customer_num) REFERENCES customer (customer_num)
        )
FRAGMENT BY EXPRESSION
        partition part0 (backlog = 'y') in dbspace1,
        partition part1 (backlog = 'n') in dbspace2;


INSERT INTO orders VALUES(1001,'05/20/2008','104D','express','y','B77836','06/01/2008',
'20.4','10.0',
'07/22/2008');
INSERT INTO orders VALUES(1002,'05/21/2008','101A','Po on box; deliver to bAck door 
onlY','n','9270',
'05/26/2008','50.6','15.3','06/03/2008');
INSERT INTO orders VALUES(1003,'05/22/2008','104D','express','n','B77890','05/23/2008',
'35.6','10.8',
'06/14/2008');
INSERT INTO orders VALUES(1004,'05/22/2008','104d','ring bell twice','Y','8006',
'05/30/2008','95.8','19.2',
'04/14/2008');
INSERT INTO orders VALUES(1005,'05/24/2008','101a','cAll before deliverY','n','2865',
'06/09/2008','80.8',
'16.2','06/21/2008');

Listing 3 shows a basic query to find all the customers who live in California.

Listing 3. Query customers who live in California
SELECT  fname, lname, state FROM  customer WHERE state LIKE  'CA' 

fname           lname           state

LUDWIG          pauli           CA
Carole          Sadler          CA
Philip          Currie          Ca
AnthonY         Higgins         CA
RaYmond         Vector          ca

You can see that even though we searched for 'CA' for state, the query returned all variations of state to include upper case, mixed case, and lower case ('CA', 'Ca' and 'ca'). A case-sensitive database would only return the three rows that matched the case used in the query in this case.

Using constraints with CI

The case insensitivity of the database will take effect on the primary and foreign keys defined on the NCHAR/NVARCHAR type columns. That is, if any variant form of the character data existed in the primary key column, then any other variant form in terms of case will not be allowed. Let's look at an example if we try to insert a new row into customer with a customer_num of '101A' instead of '101a'. Previously this would have been valid in a case-sensitive database, but not anymore in CI database.

Listing 4. Violation of a primary key constraint
INSERT into customer VALUES('101A','Jones','P','Some Shop','123 Lost Lane',null,
'NullyDale','CA',
'94086','408-789-8075');

  268: Unique constraint (informix.u112_25) violated.

  100: ISAM error:  duplicate value for a record with unique key.

Similarly, if you want to insert into the orders table with a customer_num of '101A', this is fine, even though the corresponding primary key value is '101a':

Listing 5. Foreign key constraint
INSERT  into orders VALUES(1006,'03/14/2007','101A','Call first','n','2765','05/06/2008',
'70.8','15.2',
'05/20/2008');

  268: Unique constraint (informix.u112_25) violated.

  100: ISAM error:  duplicate value for a record with unique key.

Check constraints

Check constraints can also have an impact in CI database. Here we have the customer table with the state column having the check constraints where it only allows the records to be inserted or updated that belong to state ‘ca’ or ‘nv’. The case-sensitive database will check for the records with the exact match of the state field, but a CI database will disregard the letter case difference. The CI database treats case variants among strings having the same letter sequence as duplicates, and will allow these VALUES: ‘CA’ ,‘Ca’ ,‘cA’ ,‘ca’ .

Listing 6. Check constraints
CREATE TABLE customer_check  (customer_num  NCHAR(4), fname  NCHAR(15),  lname  NCHAR(15),
company  NCHAR(20), address1  NVARCHAR(20),  address2  NVARCHAR(20), city  NCHAR(15),  
state
NCHAR(2), check (state in ('ca', 'nv')), zipcode  NCHAR(5), phone  NCHAR(18), 
primary key (customer_num));

Table created.

INSERT INTO customer_check VALUES('101A','Jones','P','Some Shop','123 Lost Lane',null,
'NullyDale','CA',
'94086','408-789-8075');


 1 Row Inserted

This example showed how to create the customer table with the check constraint on the column 'state,' and even though we enter 'CA' in our insert, this is valid because VALUES 'ca' and 'nv' are valid for this check.

Using the DISTINCT keyword

It is also worth noting that when selecting distinct values with NCHAR/NVARCHAR types in a CI database, your rows returned will be now different, because case is ignored. Suppose you wanted to find all the unique customers who had placed orders from your company.

Listing 7. Distinct select
SELECT DISTINCT(customer_num) FROM orders;

customer_num

101A
104D

You can see from the results that '101a' and '101A' are not distinct VALUES with the CI feature.

Using aggregates and GROUP BY

If you are aggregating or grouping with NCHAR or NVARCHAR columns in a CI database, letter case will be ignored so rows returned would be different if, for example, you were taking a count of all orders related to the customer_num '104d'.

Listing 8. Aggregate select
SELECT COUNT(customer_num) FROM orders WHERE customer_num = '104d';

(count)

    3

When using a GROUP BY clause in a CI database, both ‘104d’ and ‘104D’ will be placed in the one group and the output will be either of the values '104d' or '104D', based on the order in which the rows are processed.

Listing 9. GROUP BY Select
SELECT customer_num FROM orders GROUP BY customer_num;

customer_num

101A

104D

These VALUES returned could just as easily be '101a' or '104d'. Values will be ordered without respect or preference for letter case. The letters 'a' and 'A' are considered equivalent, and no preference is given to one or the other during ordering.

Using ORDER BY

In CI database, collating operations on NCHAR and NVARCHAR data disregard letter case differences, so that the database server treats case variants among strings composed of same sequence of letters as duplicates. The collated list orders these case-insensitive duplicates in their order of retrieval.

Listing 10. Order By Select
SELECT order_num , customer_num FROM orders ORDER BY customer_num;
   order_num customer_num

       1002 101A
       1005 101a
       1006 101A
       1004 104d
       1001 104D
       1003 104D
6 row(s) retrieved.

In this example, it's clear that the VALUES ‘101A’ and ‘101a’ are treated as duplicates. Similarly, the VALUES ‘101D’ and ‘101d’ and the ordering among these duplicate values is governed by the order in which they were retrieved.

Fragmenting data in a CI database

If you want to fragment data, you can do so using NCHAR and NVARCHAR types. Here is an example of creating the orders table and fragmenting based on the backlog column. When NCHAR and NVARCHAR columns are used in fragment by expression in a CI database, the database server stores all case insensitive VALUES together in the same fragment, although each original data value remains the same. The server applies a similar approach for fragment elimination for NCHAR and NVARCHAR columns and expressions in a CI database. Fragmentation keys for indexes on CI columns are similar to CI fragmentation keys on tables. Both "Fragment by expression" and "fragment by list" will use case insensitive approach for NCHAR and NVARCHAR columns.

Listing 11. Fragmenting by expression with NCHAR
SELECT order_num, backlog FROM orders;

 order_num backlog

       1004 Y
       1001 n
       1002 n
       1003 n
       1005 n
       1006 n

6 row(s) retrieved.

In this case order_num 1001 and 1004 records are stored in partition part0 in dbspace1, because fragment expression of backlog VALUES 'Y' and 'y' match in a case-insensitive database.

Listing 12. Set Explain example
SET EXPLAIN ON;

Explain set.

SELECT order_num, backlog FROM orders WHERE backlog = 'y';


  order_num backlog 

       1001 y
       1004 Y

2 row(s) retrieved.


SELECT order_num, backlog FROM orders WHERE backlog = 'y'

Estimated Cost: 2
Estimated # of Rows Returned: 1

  1) informix.orders: SEQUENTIAL SCAN  (Serial, fragments: 0)
        Fragments Scanned: (0) part0 in dbspace1

        Filters: informix.orders.backlog = 'y'

In the explain output in Listing 12, the server used fragment elimination on the NCHAR column backlog.

Indexes in a CI Database

You can create indexes on NCHAR and NVARCHAR columns and take advantage of the CI functionality. In this example, we create an index on the NCHAR column 'phone'.

Listing 13. Index creation in a CI Database
CREATE INDEX i ON customer(phone);

index Created

If we create a unique index, duplicate values will return an error, as shown in Listing 14.

Listing 14. Unique Index creation in a CI Database
CREATE UNIQUE INDEX z ON customer(lname);

 Index created.

INSERT INTO customer VALUES('105f','Boris','VECTOR','John Sports','18 Bull 
Drive',null,'Santa fe','nm','94011','315-776-3249');

   239: Could not insert new row - duplicate value in a UNIQUE INDEX column 
   100: ISAM error:  duplicate value for a record with unique key.

The unique index gives us an error when we try to insert 'VECTOR' into lname, which already has a row containing the value 'Vector.'

CI joins

Let's have a look at how to join two tables with CI data. We will join the customer and cust_calls table, using customer_num as the join column.

Listing 15. Join of customer and cust_calls
SELECT c.customer_num, c.lname, c.company,
       c.phone, u.call_dtime, u.call_descr
       FROM customer c, cust_calls u
       WHERE c.customer_num = u.customer_num
       ORDER BY customer_num;

customer_num       101a
lname              pauli
company            All SportS Supplies
phone              408-789-8075
call_dtime         2008-06-12 08:20
call_descr         order was received

customer_num  102b
lname              Sadler
company            Sports Spot
phone              415-822-1289
call_dtime         2008-07-07 10:24
call_descr         order placed one month ago (6/7) not received.

customer_num  103c
lname              Currie
company            Phil Sports
phone              415-328-4543
call_dtime         2008-07-01 15:00
call_descr         Bill does not rflect credit

customer_num  104d
lname              Higgins
company            PlaY ball!
phone              415-368-1100
call_dtime         2008-07-10 14:05
call_descr         Customer likes our merchandise.

customer_num  105e
lname              Vector
company            Los Altos Sports
phone              415-776-3249
call_dtime         2008-07-31 14:30
call_descr         Received Hero watches

Even though the custom_num column data in customer and in cust_calls differs in case, our join completes successfully.

Now let's try a three table, outer join to join the customer, orders and cust_calls tables.

Listing 16. Outer join of customer, cust_calls and orders
SELECT c.customer_num, lname, o.order_num,
       order_date, call_dtime
       FROM customer c, outer orders o, outer cust_calls x
       WHERE c.customer_num = o.customer_num
                and c.customer_num = x.customer_num
       ORDER BY c.customer_num;


customer_num lname       order_num  order_date  call_dtime

101a         pauli            1002  05/21/2008  2008-06-12 08:20
101a         pauli            1005  05/24/2008  2008-06-12 08:20
101a         pauli            1006  03/14/2007  2008-06-12 08:20
102b         Sadler                             2008-07-07 10:24
103c         Currie                             2008-07-01 15:00
104d         Higgins          1001  05/20/2008  2008-07-10 14:05
104d         Higgins          1003  05/22/2008  2008-07-10 14:05
104d         Higgins          1004  05/22/2008  2008-07-10 14:05
105e         Vector                             2008-07-31 14:30

Comparing CI datatypes with non CI datatypes

When a query compares a character value with another data type, the server implicitly casts at least one of the two values. The data type of the result is indicated in the following table:

TYPENCHARNVARCHARCHARVARCHARLVARCHAROther
NCHARNCHARNVARCHAR or NCHARNCHARNVARCHAR or NCHARNVARCHAR or NCHARNVARCHAR or NCHAR
NVARCHARNVARCHAR or NCHARNVARCHAR or NCHARNVARCHAR or NCHARNVARCHAR or NCHARNVARCHAR or NCHARNVARCHAR or NCHAR
CHARNCHARNVARCHAR or NCHARCHARVARCHAR or LVARCHARLVARCHARVARCHAR or LVARCHAR
VARCHARNVARCHAR or NCHARNVARCHAR or NCHARVARCHAR or LVARCHARVARCHAR or LVARCHARLVARCHARVARCHAR or LVARCHAR
LVARCHARNVARCHAR or NCHARNVARCHAR or NCHARLVARCHARLVARCHARLVARCHARLVARCHAR
OtherNVARCHAR or NCHARNVARCHAR or NCHARVARCHAR or LVARCHARVARCHAR or LVARCHARLVARCHARVARCHAR or LVARCHAR
Listing 17. Comparing CHAR to NCHAR VALUES
SELECT user_id, call_descr FROM cust_calls WHERE user_id = call_descr;

user_id     richc
call_descr  Richc

 1 row(s) retrieved.

You can see in this example that even though the column call_descr is a CHAR datatype, which is not CI, the comparison works. The server casts CHAR to NCHAR while processing the predicate of the query.

Data distribution statistics

Case insensitivity of the database while gathering the distribution statistics of a table will affect NCHAR and NVARCHAR columns, where the server will evaluate all the case variant values as one distinct value. The Informix server will use these statistics for generating the best possible plans for processing the queries. The following example shows what the statistics data looks like. In this example, after running 'UPDATE STATISTICS HIGH', the distribution information about the column 'cust_num' of the 'orders' table will look like this.

Listing 18. Exporting a non-CI database
Distribution for informix.orders.customer_num

Constructed on 2011-04-29 16:29:15.00000

High Mode, 0.500000 Resolution

--- DISTRIBUTION ---

    (        101A )


--- OVERFLOW ---

 1: ( 2,     101A )
 2: ( 3,     104d )

The most important factor here is that the VALUES '101A' and '101a' are stored together in one bin, unlike in case-sensitive databases, where they will go into separate bins.

Distributed queries

Any queries across databases, whether within the same server or across server instances, will throw an error if the databases are not the same in terms of case sensitivity. If a participating database of the same instance or of another server is not of the same type in terms of case sensitivity, then the server will report an error. The only exceptions here are system databases.

Migrating a database to a CI database

To migrate data from an existing non-CI database, the user will need to use the Informix tools dbimport and dbexport. Before using these tools, the user can change CHAR and VARCHAR types in the non CI database to NCHAR and NVARCHAR (respectively) to make use of CI functionality. The user who is migrating the data will need to ensure that the data to be migrated to the CI database is consistent with CI conventions. For example, unique indexes on NCHAR columns in a CI database will throw exceptions for data such as 'IBM' and 'ibm'. Referential constraints might be violated when moving data from a non-CI database to a CI database. Consider if you had a primary key on a CHAR column in your non CI database which contained the rows 'IBM' and 'ibm'. When exporting this data and importing it into your CI database with a NCHAR column, you would receive a constraint violation error.

Listing 19. Exporting a non-CI database
Dbexport -c mydb

dbexport completed

To import the data and recreate it as a CI database there is a new parameter available to dbimport. The command line option '-ci' tells dbimport to create the database as CI.

Listing 20. Listing 21. Importing data into a CI database
# First drop existing mydb database
DROP DATABASE mydb;

Database dropped.

# Import old mydb database
Dbimport  -ci mydb -i  ./

dbimport completed

You will now have a newly created CI database.

Conclusion

Informix CI databases offer more options to use and search your data. CI can be a very important issue when storing and retrieving data. Very often when you import data into your database from external sources, case will not always be preserved, and user errors can also mix case frequently. The new CI functionality makes the Informix DBA's life simpler when retrieving and searching data, and case is no longer a factor to consider when CI is implemented correctly.


Downloadable resources


Comments

Sign in or register to add and subscribe to comments.

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=753995
ArticleTitle=Perform case-insensitive queries in Informix
publish-date=08252011