- Introducing case insensitivity in Informix
- Using constraints with CI
- Check constraints
- Using the DISTINCT keyword
- Using aggregates and GROUP BY
- Using ORDER BY
- Fragmenting data in a CI database
- Indexes in a CI Database
- CI joins
- Comparing CI datatypes with non CI datatypes
- Data distribution statistics
- Distributed queries
- Migrating a database to a CI database
- Downloadable resources
Perform case-insensitive queries in Informix
IBM Informix databases have traditionally been case-sensitive. This means that if you entered upper case "IBM" and lower case "ibm" in separate rows in a CHAR column, and then queried for the lower case version "ibm," only the lower case value would be returned. If you wanted to search CHAR data in a case-insensitive fashion, you had few options. You could create a functional index, use a basic text search index, or use UPPER or LOWER built-in functions; but all of these options would require application changes or possible performance degradation.
With Informix 11.70.xc2, for the first time Informix can now support case-insensitive (CI) database functionality. Why would you want CI searching? Consider if you had a column describing the company name for your clients. If you had multiple values representing the same company, one upper case and one lower case, such as "IBM" and "ibm," both represent accurate data that you would want returned in any such query. With new support for CI searching in Informix 11.70.xc2, you can now achieve such results.
Introducing case insensitivity in Informix
Informix offers case insensitivity through two datatypes: NCHAR and NVARCHAR, used in conjunction with an Informix CI database. All other types within this database remain case-sensitive, so if you wish to use the CI feature you must use these datatypes.
If you want support for the CI feature, you will need to establish it when you create the database. The following command shows how to create a CI database:
CREATE DATABASE mydb WITH LOG NLSCASE INSENSITIVE;
With this statement, you have created a database called 'mydb' that will allow all NCHAR and NVARCHAR columns created inside this database to behave in a CI fashion.
Sometimes you will have a database already created on your server. Before you enter or query data, it would be useful to know if that database supports CI or not. There is a column in the sysmaster:sysdatabases table that provides the answer. When is_case_insens is set to 1, this indicates that the database is CI. Listing 1 shows how to query this system table to find out whether or not the database supports CI:
Listing 1. How to know if your database is CI or not
SELECT is_case_insens FROM sysmaster:sysdatabases WHERE name LIKE 'mydb'; is_case_insens 1 1 row(s) retrieved.
Let's create a basic schema inside our CI database 'mydb' and a variation of the stores database to show how we can utilize CI functionality. We will use both NCHAR and NVARCHAR datatypes, because these are the only supported types for CI.
Listing 2. Create a basic stores schema using NCHAR and NVARCHAR datatypes
CREATE TABLE customer ( customer_num NCHAR(4), fname NCHAR(15), lname NCHAR(15), company NCHAR(20), address1 NVARCHAR(20), address2 NVARCHAR(20), city NCHAR(15), state NCHAR(2), zipcode NCHAR(5), phone NCHAR(18), PRIMARY KEY (customer_num) ); INSERT INTO customer VALUES('101a','LUDWIG','pauli','All SportS Supplies','213 erstWild Court',null, 'SunnyVale','CA','94086','408-789-8075'); INSERT INTO customer VALUES('102b','Carole','Sadler','Sports Spot','785 GearY St',null, 'San Francisco' ,'CA','94117','415-822-1289'); INSERT INTO customer VALUES('103c','Philip','Currie','Phil Sports','654 Poplar,P. o. boX 3498',null,'Palo AlTo' ,'Ca','94303','415-328-4543'); INSERT INTO customer VALUES('104d','AnthonY','Higgins','PlaY ball!','east Shopping Cntr. ','422 baY Road', 'Redwood CitY','CA','94026','415-368-1100'); INSERT INTO customer VALUES('105e','RaYmond','Vector','Los Altos Sports','1899 La Loma Drive',null, 'Los Altos','ca','94022','415-776-3249'); CREATE TABLE cust_calls ( customer_num NCHAR(4), call_dtime datetime year to minute, user_id NCHAR(32) default user, call_code NCHAR(1), call_descr CHAR(240), res_dtime datetime year to minute, res_descr NVARCHAR(240), PRIMARY KEY(customer_num, call_dtime), FOREIGN KEY (customer_num) REFERENCES customer (customer_num) ); INSERT INTO cust_calls VALUES('101A','2008-06-12 8:20','maryj','D','order was received ' ,'2008-06-12 8:25',' Authorized credit for two cans to customer'); INSERT INTO cust_calls VALUES('102B','2008-07-07 10:24','richc','L','order placed one month ago (6/7) not received.','2008-07-07 10:30','Checked with shipping (ed Smith)'); INSERT INTO cust_calls VALUES('103C','2008-07-01 15:00','richc','B','Richc' , '2008-07-02 8:21','Is sending new bill to customer'); INSERT INTO cust_calls VALUES('104D','2008-07-10 14:05','maryj','o','Customer likes our merchandise.', null,'Sent note to marketing'); INSERT INTO cust_calls VALUES('105E','2008-07-31 14:30','maryj','I','Received Hero watches',null,'Sent memo to shipping'); CREATE TABLE orders ( order_num serial(1001), order_date date, customer_num NCHAR(4), ship_instruct NVARCHAR(40), backlog NCHAR(1), po_num NCHAR(10), ship_date date, ship_weight decimal(8,2), ship_CHARge money(6), paid_date date, PRIMARY KEY (order_num), FOREIGN KEY (customer_num) REFERENCES customer (customer_num) ) FRAGMENT BY EXPRESSION partition part0 (backlog = 'y') in dbspace1, partition part1 (backlog = 'n') in dbspace2; INSERT INTO orders VALUES(1001,'05/20/2008','104D','express','y','B77836','06/01/2008', '20.4','10.0', '07/22/2008'); INSERT INTO orders VALUES(1002,'05/21/2008','101A','Po on box; deliver to bAck door onlY','n','9270', '05/26/2008','50.6','15.3','06/03/2008'); INSERT INTO orders VALUES(1003,'05/22/2008','104D','express','n','B77890','05/23/2008', '35.6','10.8', '06/14/2008'); INSERT INTO orders VALUES(1004,'05/22/2008','104d','ring bell twice','Y','8006', '05/30/2008','95.8','19.2', '04/14/2008'); INSERT INTO orders VALUES(1005,'05/24/2008','101a','cAll before deliverY','n','2865', '06/09/2008','80.8', '16.2','06/21/2008');
Listing 3 shows a basic query to find all the customers who live in California.
Listing 3. Query customers who live in California
SELECT fname, lname, state FROM customer WHERE state LIKE 'CA' fname lname state LUDWIG pauli CA Carole Sadler CA Philip Currie Ca AnthonY Higgins CA RaYmond Vector ca
You can see that even though we searched for 'CA' for state, the query returned all variations of state to include upper case, mixed case, and lower case ('CA', 'Ca' and 'ca'). A case-sensitive database would only return the three rows that matched the case used in the query in this case.
Using constraints with CI
The case insensitivity of the database will take effect on the primary and foreign keys defined on the NCHAR/NVARCHAR type columns. That is, if any variant form of the character data existed in the primary key column, then any other variant form in terms of case will not be allowed. Let's look at an example if we try to insert a new row into customer with a customer_num of '101A' instead of '101a'. Previously this would have been valid in a case-sensitive database, but not anymore in CI database.
Listing 4. Violation of a primary key constraint
INSERT into customer VALUES('101A','Jones','P','Some Shop','123 Lost Lane',null, 'NullyDale','CA', '94086','408-789-8075'); 268: Unique constraint (informix.u112_25) violated. 100: ISAM error: duplicate value for a record with unique key.
Similarly, if you want to insert into the orders table with a customer_num of '101A', this is fine, even though the corresponding primary key value is '101a':
Listing 5. Foreign key constraint
INSERT into orders VALUES(1006,'03/14/2007','101A','Call first','n','2765','05/06/2008', '70.8','15.2', '05/20/2008'); 268: Unique constraint (informix.u112_25) violated. 100: ISAM error: duplicate value for a record with unique key.
Check constraints can also have an impact in CI database. Here we have the customer table with the state column having the check constraints where it only allows the records to be inserted or updated that belong to state ‘ca’ or ‘nv’. The case-sensitive database will check for the records with the exact match of the state field, but a CI database will disregard the letter case difference. The CI database treats case variants among strings having the same letter sequence as duplicates, and will allow these VALUES: ‘CA’ ,‘Ca’ ,‘cA’ ,‘ca’ .
Listing 6. Check constraints
CREATE TABLE customer_check (customer_num NCHAR(4), fname NCHAR(15), lname NCHAR(15), company NCHAR(20), address1 NVARCHAR(20), address2 NVARCHAR(20), city NCHAR(15), state NCHAR(2), check (state in ('ca', 'nv')), zipcode NCHAR(5), phone NCHAR(18), primary key (customer_num)); Table created. INSERT INTO customer_check VALUES('101A','Jones','P','Some Shop','123 Lost Lane',null, 'NullyDale','CA', '94086','408-789-8075'); 1 Row Inserted
This example showed how to create the customer table with the check constraint on the column 'state,' and even though we enter 'CA' in our insert, this is valid because VALUES 'ca' and 'nv' are valid for this check.
Using the DISTINCT keyword
It is also worth noting that when selecting distinct values with NCHAR/NVARCHAR types in a CI database, your rows returned will be now different, because case is ignored. Suppose you wanted to find all the unique customers who had placed orders from your company.
Listing 7. Distinct select
SELECT DISTINCT(customer_num) FROM orders; customer_num 101A 104D
You can see from the results that '101a' and '101A' are not distinct VALUES with the CI feature.
Using aggregates and GROUP BY
If you are aggregating or grouping with NCHAR or NVARCHAR columns in a CI database, letter case will be ignored so rows returned would be different if, for example, you were taking a count of all orders related to the customer_num '104d'.
Listing 8. Aggregate select
SELECT COUNT(customer_num) FROM orders WHERE customer_num = '104d'; (count) 3
When using a GROUP BY clause in a CI database, both ‘104d’ and ‘104D’ will be placed in the one group and the output will be either of the values '104d' or '104D', based on the order in which the rows are processed.
Listing 9. GROUP BY Select
SELECT customer_num FROM orders GROUP BY customer_num; customer_num 101A 104D
These VALUES returned could just as easily be '101a' or '104d'. Values will be ordered without respect or preference for letter case. The letters 'a' and 'A' are considered equivalent, and no preference is given to one or the other during ordering.
Using ORDER BY
In CI database, collating operations on NCHAR and NVARCHAR data disregard letter case differences, so that the database server treats case variants among strings composed of same sequence of letters as duplicates. The collated list orders these case-insensitive duplicates in their order of retrieval.
Listing 10. Order By Select
SELECT order_num , customer_num FROM orders ORDER BY customer_num; order_num customer_num 1002 101A 1005 101a 1006 101A 1004 104d 1001 104D 1003 104D 6 row(s) retrieved.
In this example, it's clear that the VALUES ‘101A’ and ‘101a’ are treated as duplicates. Similarly, the VALUES ‘101D’ and ‘101d’ and the ordering among these duplicate values is governed by the order in which they were retrieved.
Fragmenting data in a CI database
If you want to fragment data, you can do so using NCHAR and NVARCHAR types. Here is an example of creating the orders table and fragmenting based on the backlog column. When NCHAR and NVARCHAR columns are used in fragment by expression in a CI database, the database server stores all case insensitive VALUES together in the same fragment, although each original data value remains the same. The server applies a similar approach for fragment elimination for NCHAR and NVARCHAR columns and expressions in a CI database. Fragmentation keys for indexes on CI columns are similar to CI fragmentation keys on tables. Both "Fragment by expression" and "fragment by list" will use case insensitive approach for NCHAR and NVARCHAR columns.
Listing 11. Fragmenting by expression with NCHAR
SELECT order_num, backlog FROM orders; order_num backlog 1004 Y 1001 n 1002 n 1003 n 1005 n 1006 n 6 row(s) retrieved.
In this case order_num 1001 and 1004 records are stored in partition part0 in dbspace1, because fragment expression of backlog VALUES 'Y' and 'y' match in a case-insensitive database.
Listing 12. Set Explain example
SET EXPLAIN ON; Explain set. SELECT order_num, backlog FROM orders WHERE backlog = 'y'; order_num backlog 1001 y 1004 Y 2 row(s) retrieved. SELECT order_num, backlog FROM orders WHERE backlog = 'y' Estimated Cost: 2 Estimated # of Rows Returned: 1 1) informix.orders: SEQUENTIAL SCAN (Serial, fragments: 0) Fragments Scanned: (0) part0 in dbspace1 Filters: informix.orders.backlog = 'y'
In the explain output in Listing 12, the server used fragment elimination on the NCHAR column backlog.
Indexes in a CI Database
You can create indexes on NCHAR and NVARCHAR columns and take advantage of the CI functionality. In this example, we create an index on the NCHAR column 'phone'.
Listing 13. Index creation in a CI Database
CREATE INDEX i ON customer(phone); index Created
If we create a unique index, duplicate values will return an error, as shown in Listing 14.
Listing 14. Unique Index creation in a CI Database
CREATE UNIQUE INDEX z ON customer(lname); Index created. INSERT INTO customer VALUES('105f','Boris','VECTOR','John Sports','18 Bull Drive',null,'Santa fe','nm','94011','315-776-3249'); 239: Could not insert new row - duplicate value in a UNIQUE INDEX column 100: ISAM error: duplicate value for a record with unique key.
The unique index gives us an error when we try to insert 'VECTOR' into lname, which already has a row containing the value 'Vector.'
Let's have a look at how to join two tables with CI data. We will join the customer and cust_calls table, using customer_num as the join column.
Listing 15. Join of customer and cust_calls
SELECT c.customer_num, c.lname, c.company, c.phone, u.call_dtime, u.call_descr FROM customer c, cust_calls u WHERE c.customer_num = u.customer_num ORDER BY customer_num; customer_num 101a lname pauli company All SportS Supplies phone 408-789-8075 call_dtime 2008-06-12 08:20 call_descr order was received customer_num 102b lname Sadler company Sports Spot phone 415-822-1289 call_dtime 2008-07-07 10:24 call_descr order placed one month ago (6/7) not received. customer_num 103c lname Currie company Phil Sports phone 415-328-4543 call_dtime 2008-07-01 15:00 call_descr Bill does not rflect credit customer_num 104d lname Higgins company PlaY ball! phone 415-368-1100 call_dtime 2008-07-10 14:05 call_descr Customer likes our merchandise. customer_num 105e lname Vector company Los Altos Sports phone 415-776-3249 call_dtime 2008-07-31 14:30 call_descr Received Hero watches
Even though the custom_num column data in customer and in cust_calls differs in case, our join completes successfully.
Now let's try a three table, outer join to join the customer, orders and cust_calls tables.
Listing 16. Outer join of customer, cust_calls and orders
SELECT c.customer_num, lname, o.order_num, order_date, call_dtime FROM customer c, outer orders o, outer cust_calls x WHERE c.customer_num = o.customer_num and c.customer_num = x.customer_num ORDER BY c.customer_num; customer_num lname order_num order_date call_dtime 101a pauli 1002 05/21/2008 2008-06-12 08:20 101a pauli 1005 05/24/2008 2008-06-12 08:20 101a pauli 1006 03/14/2007 2008-06-12 08:20 102b Sadler 2008-07-07 10:24 103c Currie 2008-07-01 15:00 104d Higgins 1001 05/20/2008 2008-07-10 14:05 104d Higgins 1003 05/22/2008 2008-07-10 14:05 104d Higgins 1004 05/22/2008 2008-07-10 14:05 105e Vector 2008-07-31 14:30
Comparing CI datatypes with non CI datatypes
When a query compares a character value with another data type, the server implicitly casts at least one of the two values. The data type of the result is indicated in the following table:
|NCHAR||NCHAR||NVARCHAR or NCHAR||NCHAR||NVARCHAR or NCHAR||NVARCHAR or NCHAR||NVARCHAR or NCHAR|
|NVARCHAR||NVARCHAR or NCHAR||NVARCHAR or NCHAR||NVARCHAR or NCHAR||NVARCHAR or NCHAR||NVARCHAR or NCHAR||NVARCHAR or NCHAR|
|CHAR||NCHAR||NVARCHAR or NCHAR||CHAR||VARCHAR or LVARCHAR||LVARCHAR||VARCHAR or LVARCHAR|
|VARCHAR||NVARCHAR or NCHAR||NVARCHAR or NCHAR||VARCHAR or LVARCHAR||VARCHAR or LVARCHAR||LVARCHAR||VARCHAR or LVARCHAR|
|LVARCHAR||NVARCHAR or NCHAR||NVARCHAR or NCHAR||LVARCHAR||LVARCHAR||LVARCHAR||LVARCHAR|
|Other||NVARCHAR or NCHAR||NVARCHAR or NCHAR||VARCHAR or LVARCHAR||VARCHAR or LVARCHAR||LVARCHAR||VARCHAR or LVARCHAR|
Listing 17. Comparing CHAR to NCHAR VALUES
SELECT user_id, call_descr FROM cust_calls WHERE user_id = call_descr; user_id richc call_descr Richc 1 row(s) retrieved.
You can see in this example that even though the column call_descr is a CHAR datatype, which is not CI, the comparison works. The server casts CHAR to NCHAR while processing the predicate of the query.
Data distribution statistics
Case insensitivity of the database while gathering the distribution statistics of a table will affect NCHAR and NVARCHAR columns, where the server will evaluate all the case variant values as one distinct value. The Informix server will use these statistics for generating the best possible plans for processing the queries. The following example shows what the statistics data looks like. In this example, after running 'UPDATE STATISTICS HIGH', the distribution information about the column 'cust_num' of the 'orders' table will look like this.
Listing 18. Exporting a non-CI database
Distribution for informix.orders.customer_num Constructed on 2011-04-29 16:29:15.00000 High Mode, 0.500000 Resolution --- DISTRIBUTION --- ( 101A ) --- OVERFLOW --- 1: ( 2, 101A ) 2: ( 3, 104d )
The most important factor here is that the VALUES '101A' and '101a' are stored together in one bin, unlike in case-sensitive databases, where they will go into separate bins.
Any queries across databases, whether within the same server or across server instances, will throw an error if the databases are not the same in terms of case sensitivity. If a participating database of the same instance or of another server is not of the same type in terms of case sensitivity, then the server will report an error. The only exceptions here are system databases.
Migrating a database to a CI database
To migrate data from an existing non-CI database, the user will need to use the Informix tools dbimport and dbexport. Before using these tools, the user can change CHAR and VARCHAR types in the non CI database to NCHAR and NVARCHAR (respectively) to make use of CI functionality. The user who is migrating the data will need to ensure that the data to be migrated to the CI database is consistent with CI conventions. For example, unique indexes on NCHAR columns in a CI database will throw exceptions for data such as 'IBM' and 'ibm'. Referential constraints might be violated when moving data from a non-CI database to a CI database. Consider if you had a primary key on a CHAR column in your non CI database which contained the rows 'IBM' and 'ibm'. When exporting this data and importing it into your CI database with a NCHAR column, you would receive a constraint violation error.
Listing 19. Exporting a non-CI database
Dbexport -c mydb dbexport completed
To import the data and recreate it as a CI database there is a new parameter available to dbimport. The command line option '-ci' tells dbimport to create the database as CI.
Listing 20. Listing 21. Importing data into a CI database
# First drop existing mydb database DROP DATABASE mydb; Database dropped. # Import old mydb database Dbimport -ci mydb -i ./ dbimport completed
You will now have a newly created CI database.
Informix CI databases offer more options to use and search your data. CI can be a very important issue when storing and retrieving data. Very often when you import data into your database from external sources, case will not always be preserved, and user errors can also mix case frequently. The new CI functionality makes the Informix DBA's life simpler when retrieving and searching data, and case is no longer a factor to consider when CI is implemented correctly.