Supported languages and code pages

You can specify that the text documents be parsed using a particular language when you first create a text search index. You can also specify that the query terms be interpreted in a particular language while searching. In addition, you can specify a code page when you create a text search index on a binary data type column.

Language specification

A locale is a combination of language and territory (region or country) information and is represented by a five-character locale code. You define the message locale for a text search administration procedure by passing the procedure the locale code. Refinements of these locale codes are possible depending on the locales installed on the Db2® server.

There is an important difference between specifying a language when you create a text search index and specifying a language when you issue a search query:
  • The locale that you specify in your db2ts CREATE INDEX command determines the language used to tokenize or analyze documents for indexing. If you know that all documents in the column to be indexed use a specific language, specify the applicable locale when you create the text search index. If you do not specify a locale, the database territory will be used to determine the default setting for LANGUAGE. To have your documents automatically scanned to determine the locale, in the SYSIBMTS.TSDEFAULTS view, set the LANGUAGE attribute to AUTO. The SYSIBMTS.TSDEFAULTS view describes database defaults for text search using attribute-value pairs.
  • The locale that you specify in a search query is used to perform linguistic processing on the query and to help identify the base forms of the query term. After the locale of the base form has been identified, the locale does not play any part in the search process itself. Thus, you could use the English language for a query and obtain German documents in the search result if the search term in its base form is present in the documents.

The following table lists the locales that Db2 Text Search supports for document processing.
Table 1. Supported locales
Locale code Language Territory
ar_AA Arabic Arabic countries or regions
cs_CZ Czech Czech Republic
da_DK Danish Denmark
de_CH German Switzerland
de_DE German Germany
el_GR Greek Greece
en_AU English Australia
en_GB English United Kingdom
en_US English United States
es_ES Spanish Spain
fi_FI Finnish Finland
fr_CA French Canada
fr_FR French France
it_IT Italian Italy
ja_JP Japanese Japan
ko_KR Korean Korea, Republic of
nb_NO Norwegian Bokmål Norway
nl_NL Dutch Netherlands
nn_NO Norwegian Nynorsk Norway
pl_PL Polish Poland
pt_BR Portuguese Brazil
pt_PT Portuguese Portugal
ru_RU Russian Russia
sv_SE Swedish Sweden
zh_CN Chinese China
zh_TW Chinese Taiwan

Code page specification

You can index documents if they use one of the supported Db2 code pages. Although specifying the code page when creating a text search index is optional, doing so helps to identify the character encoding of binary columns. If you do not specify a code page for binary columns, the code page from the column property is used. The list of supported territory codes and code pages can be found here.