Alternative Unicode conversion table for CCSID 943

There are several IBM® coded character set identifiers (CCSIDs) for Japanese code pages. CCSID 943 is registered as the Microsoft Japanese Windows Shift-JIS code page. You might encounter the following two problems when converting characters between CCSID 943 and Unicode. The problems are the result of differences between the IBM code page conversion tables and the Microsoft code page conversion tables.

Problem 1:

For historical reasons, over 300 characters in the CCSID 943 code page are represented by two or three code points each. The use of input method editors (IMEs) and code page conversion tables cause only one of these equivalent code points to be entered. For example, the lower case character for Roman numeral one (i) has two equivalent code points: X'EEEF' and X'FA40'. Microsoft Windows IMEs always generate X'FA40' when i is entered. In general, IBM and Microsoft use the same primary code point to represent the character, except for the following 13 characters:

Table 1. CCSID 943 Shift-JIS code point conversion
Character name (Unicode code point) IBM primary Shift-JIS code point Microsoft Shift-JIS primary code point
Roman numeral one (U+2160) X'FA4A' X'8754'
Roman numeral two (U+2161) X'FA4B' X'8755'
Roman numeral three (U+2162) X'FA4C' X'8756'
Roman numeral four (U+2163) X'FA4D' X'8757'
Roman numeral five (U+2164) X'FA4E' X'8758'
Roman numeral six (U+2165) X'FA4F' X'8759'
Roman numeral seven (U+2166) X'FA50' X'875A'
Roman numeral eight (U+2167) X'FA51' X'875B'
Roman numeral nine (U+2168) X'FA52' X'875C'
Roman numeral ten (U+2169) X'FA53' X'875D'
Parenthesized ideograph stock (U+3231) X'FA58' X'878A'
Numero sign (U+2116) X'FA59' X'8782'
Telephone sign (U+2121) X'FA5A' X'8784'

IBM products such as Db2® database manager primarily use IBM code points, for example X'FA4A', to present the upper case Roman numeral I, but Microsoft products use X'8754' to represent the same character. A Microsoft ODBC application can insert the I character as X'8754' into a Db2 database of CCSID 943, and the IBM Data Studio can insert the same character as X'FA4A' into the same CCSID 943 database. However, Microsoft ODBC applications can find only those rows that have I encoded as X'8754', and the IBM Data Studio can locate only those rows that have encoded I as X'FA4A'. To enable the IBM Data Studio to select I as X'8754', you need to replace the default IBM conversion table from Unicode to CCSID 943 with the alternate Microsoft conversion table provided by the Db2 database manager.

Problem 2:

The following list of characters, when converted from CCSID 943 to Unicode, will result in different code points depending on whether the IBM conversion table or the Microsoft conversion table is used. For these characters, the IBM conversion table conforms to the character names as specified in the Japanese Industry Standard JISX0208, JISX0212, and JISX0221.

Table 2. CCSID 943 to Unicode code point conversion
Shift-JIS code point (character name) IBM primary code point (Unicode name) Microsoft primary code point (Unicode name)
X'815C' (EM Dash) U+2014 (EM Dash) U+2015 (Horizontal Bar)
X'8160' (Wave Dash) U+301C (Wave Dash) U+FF5E (Fullwidth Tilde)
X'8161' (Double vertical line) U+2016 (Double vertical line) U+2225 (Parallel To)
X'817C' (Minus sign) U+2212 (Minus sign) U+FF0D (Fullwidth hyphen-minus)
X'FA55' (Broken bar) U+00A6 (Broken bar) U+FFE4 (Fullwidth broken bar)

For example, the character EM dash with the CCSID 943 code point of X'815C' is converted to the Unicode code point U+2014 when using the IBM conversion table, but is converted to U+2015 when using the Microsoft conversion table. This can create potential problems for Microsoft ODBC applications because they would treat U+2014 as an invalid code point. To avoid these potential problems, you need to replace the default IBM conversion table from CCSID 943 to Unicode with the alternate Microsoft conversion table provided by the Db2 database manager.

The use of the alternate Microsoft conversion tables between CCSID 943 and Unicode should be restricted to closed environments, where the Db2 clients and the Db2 databases that are running CCSID 943 and are all using the same alternate Microsoft conversion tables. If you have a Db2 client using the default IBM conversion tables and another client using the alternate Microsoft conversion tables, and both clients are inserting data to the same Db2 database of CCSID 943, the same character may be stored as different code points in the database.