DB2 Text Search, Part 2: Understanding linguistic processing for full text search in DB2

When is a document included in a search result?

DB2 Text Search enables full text search on structured and unstructured data stored in a DB2® database. It uses linguistic processing to determine the result set of a full text-search query. This article provides an overview of the document analysis process and the treatment of various linguistic properties for text index updates, as well as an introduction to some intricacies of query syntax and semantics and their impact on query execution results.

Marion Behnen (mbehnen@us.ibm.com), DB2 Text Search Development, IBM

Marion Behnen is a Senior Software Engineer within the IBM Software Group and works as a technical lead for full-text search in DB2. She has more than 20 years of experience with database application development, business reporting, data warehousing, and business process integration. Prior to joining IBM, she was involved with many aspects of business process and data integration, in particular for the manufacturing industry.



17 June 2010

Also available in Chinese Vietnamese

Introduction

DB2 Text Search enables full text search on structured and unstructured data stored in a DB2 database. It uses linguistic processing to determine the result set of a full text-search query. DB2 Text Search is available on DB2 for Linux®, UNIX®, and Windows®, Version 9.5 Fixpack 1 and later.

Text indices can be created on a range of data types from small VARCHARs to large BLOBs that contain text objects. In the context of the DB2 Text Search index processing, for each row, the content of each column is treated as a document. Depending on the type of document, which is defined with the text index format, different ingest handlers and parsers are used to extract and filter the text.

For example, an XML document can be indexed in either the format XML or format TEXT. If format XML is used, an XML parser is used to parse and index the document, and both SQL query syntax and an XPath-like syntax can be used for querying. In this case, structural elements can be used to identify the portion of text that should be searched, but structural elements themselves are not part of the searchable body of text. With format TEXT on the other hand, a plain-text parser is used to parse and index the document. In this case, the structural elements are treated as a part of the searchable body of text, and only SQL query syntax is applicable. Figure 1 shows the basic flow.

Figure 1. Processing steps
Diagram showing flow from Ingest Handler to Parser to Tokenizer to Indexer

DB2 Text Search depends on IBM LanguageWare for lexical analysis. DB2 Text Search uses a dictionary-based approach for supported languages. It uses a character-processing and word-processing approach (without morphological analysis) for other languages. The search terms in a full text-search query are analyzed similarly to the analysis done during the indexing process. Whether a document is returned as part of the result set depends not only on the terms and how they are combined in the query, but also on the result of the linguistic processing and implicitly on the configuration of the text server, especially of the parser and tokenizer.

This section describes aspects of the linguistic processing, and it provides examples of how linguistic properties influence search results.


Minimal queries

For minimal queries, valid search terms consist of at least two characters, and they cannot contain punctuation marks only. Listing 1 shows an example.

Listing 1. Examples for valid minimal queries
select content from sampleen where contains(plaincol, 'to')=1
select content from sampleen where contains(plaincol, 't*')=1
select content from sampleen where contains(plaincol, 'c#')=1

Escape characters

Certain characters have semantic meaning as part of the search query, and they need to be escaped if they are intended to be part of a search term. Single and double quotes are escaped by doubling the quote mark, such as using Smith''s to search for Smith's.

The following characters need to be escaped with \ to be treated as part of a search query.

  • Symbols that require the escape character to achieve a syntactically valid query: < [ { ~ = : > ] }
  • Symbols that require the escape character to change semantics: && || ! * ? \ -
  • Symbols that cause syntactic errors for single use and require an escape character to change semantics:( ) " "

Stemming and lemmas

Stemming is the process of finding the constant part of a word, which is the fragment that stays the same no matter which word tense or mode is used. This fragment is not necessarily the morphological root. For example:

  • The stem for both produce and production is produc.
  • The stem for good is good.
  • The stem for better is better.

Lemmatization is the process of identifying the base, non-inflected form of a word. During index update processing, the lexical analysis identifies the lemmas of the words in the document, and the analysis adds this information to the text index. Correspondingly, for a keyword search, the lemmas of the search term are identified, and all documents containing that lemma are returned. For example:

  • The lemma for good is good.
  • The lemmas for better are better, good, and well.
  • The lemma for well is well.

Note the difference for better between the result for stemming and lemmatization.

A search for good based on lemmatization returns documents that match good or better, because good is matched to good itself, and good is matched to the occurrence of the lemma good associated with better. Listing 2 shows example results of a keyword search for good.

Listing 2. Example result for keyword search for good
select pk, substr(plaincol,1,68) from simpleen where contains(plaincol, 'good')=1
PK       PLAINCOL                                                                   
-------- -----------------------------------------------------------
       4 it is better here                                                   
       5 this is a good story                                                

  2 record(s) selected.

A search for better returns documents that contain good, better, or well, because all of them are listed as lemmas for better. Listing 3 shows example results of a keyword search for better.

Listing 3. Example result for keyword search for better
select pk, plaincol from simpleen where contains(plaincol, 'better')=1

PK       PLAINCOL                                                                  
-------- -----------------------------------------------------------
       2 it is well done                                                     
       3 the well is deep                                                    
       4 it is better here                                                  
       5 this is a good story                                                

  4 record(s) selected.

To retrieve only documents that match the term, but not those that match the lemmas, enclose the term in double quotes. Listing 4 shows example results of a phrase search for better.

Listing 4. Example result for phrase search for better
select pk, plaincol from simpleen where contains(plaincol, '"better"')=1

PK       PLAINCOL                                                                   
-------- -----------------------------------------------------------
       4 it is better here                                                   

  1 record(s) selected.

Enclosing the term in double-quotes excludes variations of the term and excludes synonyms and other terms with the same lemma. However, the search is case-insensitive, and some normalization applies (see Diacritics and Punctuation). Documents might be returned that do not exactly match the specified characters, as shown in Table 1.

Table 1. Table showing example for normalization behavior
A search for these words... Also returns documents containing these words
Hamlethamlet
resumerésumé
C++C#

The search expansion on lemmas is applied to the search term directly, not recursively. For example, good is a lemma for good and better, so the keyword search returns documents containing either of these terms. However, the keyword search does not include documents containing only words with other lemmas of better, thus documents containing well are not included.


Conjugation and declension

Through the identification of lemmas, inflections of nouns, pronouns, and adjectives, as well as conjugations of verbs, are recognized. Documents that contain such variations are returned for a keyword search. For example, ran and running are returned for run; and testing and tested are returned for test, as shown in Listing 5.

Listing 5. Example result for keyword search for test includes inflected terms
select pk, plaincol from simpleen where contains(plaincol, 'test')=1

PK       PLAINCOL                                                                   
-------- -----------------------------------------------------------
      16 we are testing this tomorrow                                        
      13 this cannot be tested properly                                     
      14 this is a test program                                              

  3 record(s) selected.

A phrase search is semi-exact, and it excludes documents that contain only inflected versions of the term.

Listing 6. Example result for phrase search for test excludes inflected terms
select pk, plaincol from simpleen where contains(plaincol, '"test"')=1

PK       PLAINCOL                                                                   
-------- -----------------------------------------------------------
      14 this is a test program                                              

  1 record(s) selected.

Depending on the language that is defined for index creation or the language specified for a query, a term might be expanded in the analysis. For a mixed-language document corpus, a document might be included or excluded from the search result, depending on the language used for indexing or querying. For example, if two German-language documents are included in an English-language text index, and one document contains an inflection for Regen (rain), the analysis with the English dictionary does not recognize the genitive mode of Regen, which is des Regens, while the German index shows both results. A document or portion of a document that is not in the index language is treated like a document in an unsupported language, unless a term is encountered that is part of the dictionary, in which case the term is treated according to the rules of the language set for the index. Listing 7 shows the results in the English-language index.

Listing 7. Example result for keyword search for regen in English-language index
select pk, plaincol from simpleen where contains(plaincol, 'regen')=1

PK       PLAINCOL                                                                  
-------- -----------------------------------------------------------
56 der Sturm brachte Regen mit taubeneigroßen Hagelkörnern

  1 record(s) selected.

Listing 8 shows the results in the German-language index.

Listing 8. Example result for keyword search for regen in German-language index
select pk, plaincol from simplede where contains(plaincol, 'regen')=1

PK       PLAINCOL                                                                  
-------- -----------------------------------------------------------
54 wegen des Regens wurde die Freilichtaufführung abgesagt         
56 der Sturm brachte Regen mit taubeneigroßen Hagelkörnern

  2 record(s) selected.

Clitics

Clitics are grammatically independent, but phonologically dependent, morphemes, such as the possessive s in English. The linguistic processing splits the term according to the clitic. Searching for the base form returns all inflections, including terms with possessive s, as shown in Listing 9.

Listing 9. Example result for keyword search on base form
select pk, plaincol from simpleen where contains(plaincol, 'girl')=1

PK       PLAINCOL
-------- -----------------------------------------------------------
     130 We are three siblings, all girls
     131 The girl's sweater is blue
     132 A girl gave me the book.

  3 record(s) selected.

Searching for the mode with possessive s returns only an exact match, as shown in Listing 10.

Listing 10. Example result for keyword search on term with clitics
select pk, plaincol from simpleen where contains(plaincol, 'girl''s')=1

PK       PLAINCOL
-------- ------------------------------------------------------------
     131 The girl's sweater is blue

  1 record(s) selected.

A phrase search on possessive s returns only an exact match, as shown in Listing 11.

Listing 11. Example result for phrase search on term with clitics
select pk, plaincol from simpleen where contains(plaincol, '"girl''s"')=1

PK       PLAINCOL
-------- ------------------------------------------------------------
     131 The girl's sweater is blue

  1 record(s) selected.

Note that because the clitic splits the term into fragments, the document containing "girl"s" is returned with a phrase search on the base form. Listing 12 shows an example of a phrase search on the base term.

Listing 12. Example result for phrase search on base term
select pk, plaincol from simpleen where contains(plaincol, '"girl"')=1

PK       PLAINCOL
-------- ------------------------------------------------------------
     131 The girl's sweater is blue
     132 A girl gave me the book.

  2 record(s) selected.

Filler or stop words

Certain terms are included in the text index, but they might be ignored for matching when combined with other terms. Applying such terms for a match might therefore need an explicit enforcement as a required term or need a combination with a Boolean operator. This applies to both keyword search and phrase search. Such filler-words are mostly a subset of elements of the following types:

  • Some adpositions, such as in and on
  • Coordinating conjunctions, except the sentence connectors yet and so
  • Some auxiliary verbs, such as can, may, has, am, were, and are; does not affect could, might, ought, shall, should, will, and would
  • Determiners, such as articles, demonstratives, or possessive determiners; does not affect distributive, disjunctive, or degree determiners
  • Pronouns (similar to determiners)

For example, a query for to be contains only filler terms (stop words), which are therefore used to select matches in the document, as shown in Listing 13

Listing 13. Example result for keyword search on stop words only
select pk, plaincol from simpleen where contains(plaincol, 'to be')=1 

PK 	PLAINCOL 
----- -----------------------------------------------------------
1 	to be or not to be 

1 record(s) selected.

Due to the handling of default Boolean combinations for filler terms, the matching focuses on non-filler terms if there is a combination of filler and non-filler terms in a search. For example, Listing 14 shows a query to be or not to be with the focus on the non-filler term not.

Listing 14. Example result for keyword search on mixed terms
select pk, plaincol from simpleen where contains(plaincol, 'be or not to be')=1

PK 	PLAINCOL 
----- -----------------------------------------------------------
1 	to be or not to be 
33 	my blog is not here, that's the wrong place 
10 	I can't rest here 
15 	he hasn't rested here 

4 record(s) selected.

In this case, the result set contains elements that only partially fulfill the implicit Boolean AND-ing. Note that a phrase search (enclosing the search terms in double quotes) returns the same results. To get a more selective match for such cases, use either the enforce operator (+) or an explicit Boolean operator (AND) in the keyword search. The difference between + and AND is only relevant to exclude synonyms. For example, Listing 15 enforces a be in the search result.

Listing 15. Example result for keyword search with enforce operator
select pk, plaincol from simpleen where contains(plaincol, 'to +be or not to be')=1  

PK 	PLAINCOL 
----- -----------------------------------------------------------
1 	to be or not to be 

1 record(s) selected.

Listing 16 shows the same search with an explicit AND.

Listing 16. Example result for keyword search with explicit AND operator
select pk, plaincol from simpleen where contains(plaincol, 'to AND be or not to be')=1

PK 	PLAINCOL 
----- -----------------------------------------------------------
1 	to be or not to be 

1 record(s) selected.

Note that to indicate a Boolean connection, the operator must appear in uppercase letters. The lower-case or is not treated as the Boolean OR, nor is the lowercased not treated as NOT. Listing 17 shows an example.

Listing 17. Example result for keyword search with explicit NOT operator
select pk, plaincol from simpleen where contains(plaincol, 'be or NOT to be')=1

PK       PLAINCOL
-------- ------------------------------------------------------------

  0 record(s) selected.

Diacritic characters

Diacritics have different purposes in different languages. They might or might not produce a new letter. Whether a diacritic is ignored in a keyword search depends on the language and whether a character is a new letter or not. For example, in English, a search for resume yields documents containing résumé, as shown in Listing 18.

Listing 18. Example result for keyword search related to non-distinguished diacritic characters (English)
select pk, plaincol from simpleen where contains(plaincol, 'resume')=1

PK       PLAINCOL
-------- -----------------------------------------------------------
      51 let's resume later
      50 show me your résumé

  2 record(s) selected.

Diacritics that produce new letters are distinguished. A search for schon in a German index does not return the same results as a search for schöne, while a search for cafe returns results with café, as shown in Listing 19.

Listing 19. Example result for keyword search related to non-distinguished diacritic characters (German)
select pk, plaincol from simplede where contains(plaincol, 'cafe')=1

PK       PLAINCOL
-------- -----------------------------------------------------------
      55 wir liefen stattdessen ins Café und setzten uns an ein freies Tischchen

  1 record(s) selected.

The diacritics on a, o, or u produce new letters. The search results therefore only contain entries with the matching character, as shown in Listing 20.

Listing 20. Example result for keyword search related to distinguished diacritic characters (German)
select pk, plaincol from simplede where contains(plaincol, 'schon')=1

PK       PLAINCOL
-------- -----------------------------------------------------------
     100 Ist es schon soweit?

  1 record(s) selected.


select pk, plaincol from simplede where contains(plaincol, 'schön')=1

PK          PLAINCOL
----------- -----------------------------------------------------------
        102 Das war eine schöne Feier.

  1 record(s) selected.

Spelling an umlaut as ae, oe, or ue only maps to the corresponding exact match without further expansion. The search is applied as a phrase search, so documents that contain an inflection of the search term are not returned.

Listing 21. Example result for keyword search with expanded umlaut (German)
select pk, plaincol from simplede where contains(plaincol, 'schoene')=1

PK       PLAINCOL
-------- -----------------------------------------------------------
     102 Das war eine schöne Feier.

  1 record(s) selected.

select pk, plaincol from simplede where contains(plaincol, 'schoen')=1

PK       PLAINCOL
-------- ------------------------------------------------------------

  0 record(s) selected.

Listing 22 shows that laufen and läuft return the same results, while the spelling variation laeuft is mapped to the proper spelling läuft, but not to other inflections.

Listing 22. Example result for search with umlaut
select pk, plaincol from simplede where contains(plaincol, 'laufen')=1

PK       PLAINCOL                                                                  
-------- -----------------------------------------------------------
      50 heute war ein wunderschöner Tag voll Sonnenschein, es lief gut         
      52 es läuft gut
      53 nach dem Lauf lass uns Kaffee trinken gehen
      55 wir liefen stattdessen ins Café und setzten uns an ein freies Tischchen 
	
  4 record(s) selected.
  
select pk, plaincol from simplede where contains(plaincol, 'läuft')=1

PK       PLAINCOL                                                                  
-------- -------------------------------------------------------------------
      50 heute war ein wunderschöner Tag voll Sonnenschein, es lief gut     
      52 es läuft gut    
      53 nach dem Lauf lass uns Kaffee trinken gehen
      55 wir liefen stattdessen ins Café und setzten uns an ein freies Tischchen 

  4 record(s) selected.
 
select pk, plaincol from simplede where contains(plaincol, 'laeuft')=1

PK       PLAINCOL                                                                  
-------- -----------------------------------------------------------
      52 es läuft gut
	
  1 record(s) selected.

Compound terms

The text server is currently configured to work without de-composition. You need to add a wildcard to find a document that contains the search term as part of a compound word. For example, Listing 23 shows how to find all documents that contain rain as the first part of a compound word.

Listing 23. Example for search affecting compound words
select pk, plaincol from simpleen where contains(plaincol, 'rain')=1

PK       PLAINCOL
-------- --------------------------------------------------------------------

  0 record(s) selected.


select pk, plaincol from simpleen where contains(plaincol, 'rain*')=1

PK       PLAINCOL
-------- --------------------------------------------------------------------
      41 the production today was interrupted by a rainstorm                 

  1 record(s) selected.

Language variations

The linguistic processing applies spelling normalization. A search for color returns documents with color and colour, both for keyword search and phrase search.

Listing 24. Example result for keyword search showing language variations
select pk, plaincol from simpleen where contains(plaincol, 'color')=1

PK       PLAINCOL
-------- -----------------------------------------------------------
     110 This is a nice color.
     111 I prefer this colour.

  2 record(s) selected.


select pk, plaincol from simpleen where contains(plaincol, 'colour')=1

PK       PLAINCOL
-------- -----------------------------------------------------------
     110 This is a nice color.
     111 I prefer this colour.

  2 record(s) selected.

Normalization applies to spelling variations. However, there is no automatic mapping if, for example, a different term is used in British English or American English to refer to the same object. For example, searching for lift does not return documents containing elevator. However, you can define the terms in a synonym dictionary and apply the synonym dictionary to the text index.


Abbreviations

There is no default mapping of abbreviations such as e.g., EKG, and etc. Therefore, the search for the abbreviated form returns only documents that contain the abbreviation. A search for the full term returns only documents that contain the full term. Abbreviations must be added to a synonym dictionary to be mapped to the full term for index and query processing.


Punctuation and symbols

With the current configuration of the text server, common punctuation and symbol characters are substituted for each other to find matching results. This affects the following characters:

. , ; : @ % ^ & ! + - = ( ) [ ] { } < > ~ # $ €

The following rules describe how the symbols are handled in queries.

  • The set of punctuation and symbols substitute for other members of the set, even if the search term is enclosed in double quotes, or the punctuation is escaped, as shown in Listing 25.
Listing 25. Example for substitution of punctuation and symbols
select pk, plaincol from simpleen where contains(plaincol, 'c++')=1

PK          PLAINCOL                                                                  
----------- -----------------------------------------------------------
         31 we knew C+ and Java                                                
         28 she knows C, and C++ and C#                                        
         29 he knows C++                                                       
         30 they know C++ and C#                                               

  4 record(s) selected.

select pk, plaincol from simpleen where contains(plaincol, '"c#"')=1

PK          PLAINCOL
----------- -----------------------------------------------------------
         31 we knew C+ and Java                                                
         28 she knows C, and C++ and C#                                        
         29 he knows C++                                                       
         30 they know C++ and C#                                               

  4 record(s) selected.
  • Punctuation characters split the search term into separately searchable fragments, as shown in Listing 26.
Listing 26. Example of fragmentation through punctuation
Example Data:
       114 The email-address is toro@lab.sample.com
       115 The email-address is toro.lab@sample.com
       116 The email-address is torolab@sample.com

  select pk, plaincol from simpleen where contains(plaincol, 'lab')=1

PK          PLAINCOL
----------- -----------------------------------------------------------
        114 The email-address is toro@lab.sample.com
        115 The email-address is toro.lab@sample.com

  2 record(s) selected.
  • The substitution applies only to terms where the punctuation, symbol, or whitespace is in the same position, as shown in Listing 27 and Listing 28.
Listing 27. Example of positioned substitution
select pk, plaincol from simpleen where contains(plaincol, 'toro@lab.sample.com')=1

PK          PLAINCOL
----------- -----------------------------------------------------------
        114 The email-address is toro@lab.sample.com
        115 The email-address is toro.lab@sample.com

  2 record(s) selected.
Listing 28. Example for substitution
Sample Data:
50% picked the first answer for question 4 in the multiple-choice test.
This will cost you 3.50$
It was 50-something degrees Fahrenheit.
This will take 50,000.00, at least 
Take a turn left to I-50.
about 3,50 or so
About 5000 people watched the game.

select pk, plaincol from simpleen where contains(plaincol, '50%')=1

PK          PLAINCOL
----------- -----------------------------------------------------------
        117 50% picked the first answer for question 4 in the multiple-choice test.
        118 This will cost you 3.50$
        119 It was 50-something degrees Fahrenheit.
        120 This will take 50,000.00, at least
        121 Take a turn left to I-50.
        122 about 3,50 or so

  6 record(s) selected.

In this case the % substitutes other punctuation characters and whitespace.

  • Wildcard characters cannot be used to substitute punctuation, even when escaped (for keyword and phrase search), as shown in Listing 29.
Listing 29. Example where characters don't substitute
Sample Data
        125 it happens some*times
        124 it happens some.times
        126 it happens sometimes
        127 it happens somextimes


select pk, plaincol from simpleen where contains(plaincol, 'some*times')=1

PK          PLAINCOL
----------- -----------------------------------------------------------
        126 it happens sometimes
        127 it happens somextimes

  2 record(s) selected.


select pk, plaincol from simpleen where contains(plaincol, 'some\*times')=1

PK       PLAINCOL
-------- ------------------------------------------------------------
     126 it happens sometimes
     127 it happens somextimes

  2 record(s) selected.


select pk, plaincol from simpleen where contains(plaincol, 'some.times')=1

PK       PLAINCOL
-------- -----------------------------------------------------------
     125 it happens some*times
     124 it happens some.times

2 record(s) selected.

Conclusion

This article provided a detailed look into how DB2 Text Search currently processes documents for full text search and applies linguistic attributes and query attributes to determine search results. For general information about supported query syntax, refer to the DB2 Information Center (see Resources).

Resources

Learn

Get products and technologies

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management
ArticleID=496703
ArticleTitle=DB2 Text Search, Part 2: Understanding linguistic processing for full text search in DB2
publish-date=06172010