Indexing special characters

During tokenization and language processing, Db2® Text Search identifies and indexes special characters as punctuation.

Special characters are token delimiters. For example, "jack_jones" is tokenized as three separate tokens: "jack", "_", and "jones". Emails, URLs, and file paths are broken down into tokens. For example:
  • Jack_jones@ibm.com is tokenized as jack _ jones @ ibm . com
  • http://www.ibm.com is tokenized as http :// www . ibm . com

Special characters do not occupy a token position in the file. For example, "jack_jones" is indexed with the underscore in the same token position as "jack". Special characters also do not occupy a token position when spaces are included. For example, "jack_jones" is indexed in the same way as "jack _ jones".

The token position is used for exact phrase search and for proximity search. For example, if a document contains the expression jack_jones, searching for the exact phrase ""jack jones"" finds this document.

When a sequence of special characters are indexed separately, they are searched in no particular order. For example, searching for "#$" also finds documents that contain "$#".