DB2 Version 10.1 for Linux, UNIX, and Windows

Search parameters

This topic describes the different types of parameters when searching including a description of the parameters.

Parameters

RESULT LIMIT number

A keyword specifying the maximum number of results to be returned by the full-text search.

The RESULT LIMIT should be used together with the SCORE function to ensure that the returned results are scored and only the best matching results are processed.

EXPANSION LIMIT number

A keyword specifying the maximum number of terms that a wildcard term can be expanded to for searching. For example, to determine how many times you can expand the search term 'a*'. If your index is very large and you are using many wildcard terms, you must adjust the value of this keyword if you want to obtain a larger result set. The expansion order depends on the internal organization of the text index and cannot be predetermined. If your wildcard expression is too general, and can be expanded into more search terms than specified by 'EXPANSION LIMIT', the search returns with an error, indicating that the search result has been truncated due to this limit exhaustion.

STOP SEARCH AFTER number DOCUMENT | DOCUMENTS

A keyword specifying the search threshold. The search is stopped when the given number of documents is reached during the search, and an intermediate result is returned. A lower value will increase the search performance, but may lead to fewer results and omit documents with a potentially high rank.

Note that there is no default value and the number value must be a positive integer.

boolean-search-expression

The search-terms and search-factors can be combined using the boolean operators NOT, AND, OR, ACCUM, and MINUS according to the syntax diagrams. The operators have the following precedence order (with the strongest first): NOT> MINUS = ACCUM = AND > OR. This can be seen in the following example:

"Pilot" MINUS "passenger" & "vehicle" | "transport" & "public"

is evaluated as:

(("Pilot" MINUS "passenger") & ("vehicle")) | ("transport" & "public")

The operator ACCUM evaluates to true, if one of the boolean arguments evaluates to true (which is comparable to the OR operator). The rank value is computed by accumulating rank values from both operands. The ACCUM operator has the same binding (precedence) as AND. The operator MINUS evaluates to true, if the left operand evaluates to true. The rank value is computed by taking the rank value for the left operand and subtracting a penalty, if the right operand evaluates to true.

search-primary

A search-primary, consisting of a text-literal-list evaluates to true, if any of the text-literals is found in the (specified section of the) document. A search-primary consisting of a thesaurus-invocation evaluates to true, if any of the expanded text-literals is found in the (specified section of the) document.

SECTION | SECTIONS section-name

A keyword specifying one or more sections in a structured document that the search is to be restricted to. The section name must be specified in a model file specified at index creation time or be expressed in XPath notation.

Section names are case sensitive. Ensure that the case of the section name in the model file and query is identical.

This model describes the structure of documents that contain identifiable sections, so that the content of these sections can be individually searched. Section names cannot be masked using masking characters. The positive-search-factor using the SECTION clause evaluates to true, if the search primary is found in one of the specified sections.

Section names are not valid XPath expressions that are evaluated during query execution. If no model file is used, the default section names are phrased in XPath notation. The absolute path expression to the element (such as /father/child/grandchild ) is used as the name for identifying the section. Full XPath expressions are not supported as section names.

context-argument IN SAME context-unit AS context-argument AND context-argument ...

This condition lets you search for a combination of text-literals occurring in the same paragraph or same sentence. Context arguments are always equivalent to text-literal-lists, and thesaurus expansion may be used to expand a text-literal to such a list.

The condition evaluates to true, if there is a context-unit (paragraph or sentence) in the document, which contains at least one of the text-literals of each expanded context-argument. This can be seen in the following example:

("a","b") IN SAME PARAGRAPH AS ("c","d") 
          AND THESAURUS "t1" EXPAND SYNONYM TERM OF "e".

Assuming e1, e2 are synonyms of e, the following paragraphs would match:

".. a c e .." ,  ".. a c e1..",  "a c e2..",
".. a d e .." ,  ".. a d e1..",  "a d e2..",
".. b c e .." ,  ".. b c e1..",  "b c e2..",
".. b d e .." ,  ".. b d e1..",  "b d e2..".

PRECISE FORM OF

A keyword that causes the word (or each word in the phrase) following PRECISE FORM OF to be searched for exactly as typed. This form of search is case-sensitive; that is, the use of upper- and lowercase letters is significant. For example, if you search for mice, you do not find "Mouse".

This parameter requires that the index configuration parameter Respect case is set to yes. This configuration setting cannot be changed after the index has been built.

STEMMED FORM OF

A keyword that causes the word (or each word in the phrase) following STEMMED FORM OF to be reduced to its word stem before the search is carried out. This form of search is not case-sensitive. For example, if you search for mouse, you find "Mouse".

The way in which words are reduced to their stem form is language-dependent. Currently, only English stemming is supported and the word must follow regular inflection endings.

FUZZY FORM OF

A keyword for making a "fuzzy" search, which is a search for terms that have a similar spelling to the search term. This is particularly useful when searching in documents that were created by an Optical Character Recognition (OCR) program. Such documents often include misspelled words. For example, the word economy could be recognized by an OCR program as econony. Note that successful matches are only returned for words in a document where the first three characters match. In the previous example, ecanomy is not a match. Fuzzy search cannot be used if a word in the search atom contains a masking character.

match-level

An integer between 1 and 100 specifying the degree of similarity, where 100 is more similar than 1. 100 specifies an "exact match", and 60 is already considered a very "fuzzy value". The fuzzier the match level is, the longer the elapsed search time, since more documents qualify for the search. The default match level is 70.

WEIGHT number

Associates a text-literal with a weight value to change the default score. The allowed weight values are integers between 0 (the lowest score weighting) and 1000 (the highest); the default value is 100.

word-or-phrase

A word or phrase to be searched for. The characters that can be used within a word are language-dependent. It is also language-dependent whether words need to be separated by separator characters. For English and most other languages, each word in a phrase must be separated by a blank character.

To search for a character string that contains double quotation marks, type the double quotation marks twice. For example, to search for the text "wildcard" character, use:

"""wildcard"" character"

Note that in the example, it is only possible to search for one set of quotation marks. You cannot search for two quotation marks in a sequence. There is also a maximum length of 128 bytes for each word or phrase.

Masking characters

A word can contain the following masking characters:

_ (underscore): Represents any single character.
% (percent): Represents any number of arbitrary characters. If a word consists of a single %, then it represents an optional word of any length. A word cannot be composed exclusively of masking characters, except when a single % is used to represent an optional word. If you use a masking character, you cannot use the THESAURUS keyword. Masking characters cannot be used inside thesaurus query parts. If they are used in combination, search results are unpredictable. Masking characters cannot follow a non-alphanumeric character. Masking characters cannot be used inside a fuzzy search as masking always expands into a single word.

ESCAPE escape-character

A character that identifies the next character as one to be searched for and not as one to be used as a masking character. For example, if an escape-character is $, then $%, $_, and $$ represent %, _, and $. Any % and _ characters not preceded by $ represent masking characters.

During search, you are only allowed to use single-byte escape characters. No double-byte characters are allowed.

THESAURUS thesaurus-name

A keyword used to specify the name of the thesaurus to be used to expand a text-literal. The thesaurus name is the file name (without its extension) of a thesaurus that has been compiled using the thesaurus compiler. It must be located in <os-dependent>/sqllib/db2ext/thes. Alternatively, the full path can be specified preceding the file name.

EXPAND relation

Specifies which relation is used to expand the text-literal using the thesaurus. The thesaurus has predefined relations described in the DB2EXTTH command. These are referred to using the following keywords:

SYNONYM, a symmetrical relationship expressing equivalence.
RELATED, a symmetrical relationship expressing association.
BROADER, a directed hierarchical relationship that can be followed by specified depth levels.
NARROWER, a directed hierarchical relationship that can be followed by specified depth levels.

For user-defined relations, use RELATION(number), that corresponds to the relation definition in DB2TEXTTH.

TERM OF text-literal

The text-literal, to which other search terms are to be added from the thesaurus.

count LEVELS

A keyword used to specify the number of levels (the depth) of terms in the thesaurus that are to be used to expand the search term for the given relation. If you do not specify this keyword, a count of 1 is assumed. The value of depth must be a positive integer value.

ATTRIBUTE attribute-name

Searches for documents that have attributes matching the specified condition. The attribute-name refers to the name of an attribute expression in the CREATE INDEX command, or to an attribute definition in the document model file.

The attribute-factor is allowed for attributes of type double only. The precision of the value is guaranteed for 15 digits. Numbers that consist of 16 digits and above are rounded. Usage of masking characters is not allowed in attribute-name, valueFrom and, valueTo. For an explanation, see the following:

BETWEEN valueFrom AND valueTo: A BETWEEN attribute factor evaluates to true if the value of the attribute is greater than (not equal to) valueFrom and smaller than (not equal to) valueTo.
>valueFrom: A ">" attribute factor evaluates to true if the value of the attribute is greater than (not equal to) valueFrom.
<valueTo: A "<" attribute factor evaluates to true if the value of the attribute is lower than (not equal to) valueTo.

If the attribute name in the CREATE INDEX command is specified with quotation marks, or is defined in a model file, the specified attribute name must match exactly. Whereas, if no quotation marks are specified in the CREATE INDEX command, the attribute name must be in uppercase.

IS ABOUT language word-or-phrase

An option that lets you specify a free-text search argument. Using IS ABOUT, you can search for any (but not necessarily all) of the words that you specify in word-or-phrase in any order in a document. The closer together the terms used in word-or-phrase are and the more terms that are included in a document, the higher the returned score for the document.

The parameter language is optional and must be set only for Thai (TH_TH) where it is required for tokenization purposes, and for Turkish (TR_TR), where it is required for proper case mapping.

Note that IS ABOUT is useful only if document score values are requested and the search results are ordered by score values.