IBM Support

[Db2] Db2 Text Search : how to handle Japanese multi-byte specific numeric strings

Question & Answer


Question

 Why do we receive unexpected results for Japanese numeric strings, such as '4','4' and '④' on Db2 Text Search?

Answer

There are two cases how Db2 Text Search handles various numeric Japanese characters.
Case #1: Multi-byte Japanese numeric characters
===============================================
Here is a sample shell script.
#!/bin/sh

export LANG=ja_JP.utf8

db2 -v "drop db db1"
db2 -v "create db db1 using codeset utf-8 territory ja"
db2 -v "connect to db1"
db2 -v "create tablespace systoolspace in ibmcatgroup managed by automatic storage using stogroup ibmstogroup extentsize 4"
db2ts "start for text"
db2ts "enable database for text connect to db1"
db2 -v "create table t1 (c1 varchar(6) NOT NULL, doc vargraphic(512), primary key(c1))"
export DB2DBDFT=DB1

db2 -v "insert into t1 values ('1','4 : single byte numeric four')"
db2 -v "insert into t1 values ('2','4 : multi byte numeric four')"
db2 -v "insert into t1 values ('3','④ : multi byte numeric with circle four')"

db2ts "create index idx_text for text on t1(doc) connect to db1"
db2ts "update index idx_text for text connect to db1"

db2 -v "select * from t1 where contains(doc,'"4"')=1"
db2 -v "select * from t1 where contains(doc,'"4"')=1"
db2 -v "select * from t1 where contains(doc,'"④"')=1"
Db2 Text Search creates the index with a following rule:
  * Multi-byte numeric characters are converted into single numeric characters.
    This rule is applied any Japanese numeric characters, such as 1 to 9 and ① to ⑨.
Based on this rule, here is the created Db2 Text Search index data:
 
'1','4 : single byte numeric four'
'2','4 : multi byte numeric four'
'3','4 : multi byte numeric with circle four'
As the result, all three select statements return always 3 rows.  There is no multi-byte numeric in the Db2 Text Search index so no way to select "multi-byte numeric character" nor "multi-byte numeric with circle".  It is an expected Db2 Text Search behavior and it works as designed.

Case #2: Multi-byte Japanese roman numeric characters
=====================================================
Here is a sample shell script.
#!/bin/sh

export LANG=ja_JP.utf8

db2 -v "drop db db1"
db2 -v "create db db1 using codeset utf-8 territory ja"
db2 -v "connect to db1"
db2 -v "create tablespace systoolspace in ibmcatgroup managed by automatic storage using stogroup ibmstogroup extentsize 4"
db2ts "start for text"
db2ts "enable database for text connect to db1"
db2 -v "create table t1 (c1 varchar(6) NOT NULL, doc vargraphic(512), primary key(c1))"
export DB2DBDFT=DB1

db2 -v "insert into t1 values ('1','ⅳ : multi byte small letter roman numeric four')"
db2 -v "insert into t1 values ('2','Ⅳ : multi byte capital letter roman numeric four')"

db2ts "create index idx_text for text on t1(doc) connect to db1"
db2ts "update index idx_text for text connect to db1"

db2 -v "select * from t1 where contains(doc,'"ⅳ"')=1"
db2 -v "select * from t1 where contains(doc,'"Ⅳ"')=1"
Db2 Text Search creates the index with a following rule:
  * Multi-byte capital letter roman characters are converted into small letter one.
    This rule is applied any Japanese multi-byte capital roman letter characters, such asⅠto Ⅸ.
Based on this rule, here is the created Db2 Text Search index data:
 
'1','ⅳ : multi byte small letter roman numeric four'
'2','ⅳ : multi byte capital letter roman numeric four'
As the result, two select statements return always 2 rows.  There is no multi-byte capital letter roman numeric in the Db2 Text Search index so no way to select them.  It is an expected Db2 Text Search behavior and it works as designed.

NOTE:
Refer to technote for Db2 Text Search installation and configuration:
[Db2] Simple test case shell script for Db2 Text Search
https://www.ibm.com/support/pages/node/6187407

[{"Type":"MASTER","Line of Business":{"code":"LOB10","label":"Data and AI"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SSEPGG","label":"Db2 for Linux, UNIX and Windows"},"ARM Category":[{"code":"a8m500000008PlsAAE","label":"Extenders->Text Search"}],"ARM Case Number":"TS004618261","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Versions"}]

Document Information

Modified date:
21 July 2021

UID

ibm16474145