XML search

You can index and search XML documents. The XML search grammar uses a subset of the W3 XPath language with extensions for text search. The extensions support range searches of numeric, Date, and DateTime values that are associated with an XML attribute or element. Structural elements can be used separately, or combined with free text in queries.

Documents must be indexed to include the XML markup before the index can be searched using the xmlxp query syntax. Document indexing is done by using the “FORMAT XML” option at index creation time.

Indexes created on a previous release can be used to perform searches. However, documents indexed on a previous release do not have the information necessary to use all the XML search capabilities available in a newer release. Documents added or updated in the text search index after the upgrade to the new release include the additional information.

An upgrade might result in documents indexed on the prior release not being included in some search results. The SYSPROC.SYSTS_REPRIMEINDEX stored procedure can be used to rebuild the index and resolve this problem.

To use the OMNIFIND CONTAINS and SCORE built-in functions to search XML data, the query string must start with the @xmlxp: query prefix. The prefix is followed by a valid XML Search query expression. The @xmlxp 'opaque' term prefix indicates that a search is performed using the query path expression.

For example: CONTAINS(columnname, ‘@xmlxp:''query_expression'' ‘).

The single quotes ‘ ' surrounding the query_expression must be doubled because they are contained within an SQL string, in effect, a string within a string.

The @xpath: opaque term prefix that was used in previous releases of OmniFind Text Search Server for DB2® for i is supported for compatibility with earlier versions. However, it has been deprecated and is not recommended.

The following list highlights the key features of XML search:

XML structural search

By including special opaque XML terms in queries, you can search XML documents for structural elements and text that is scoped by those elements. Structural elements are tag names, attribute names, and attribute values. Element and tag names are case sensitive.

XML query tokenization

Tokenization is the process of parsing input into tokens. Free text in XML query terms is tokenized the same way that text in non-XML query terms is tokenized. An exception is that nested opaque terms are not supported. Free text search is not case sensitive.

XML Schema and DTD

Any XML schema associated with the XML document is not downloaded, and default values are not indexed.

Numeric values

Predicates that compare attribute or element values to numbers are supported.

Element values

Predicates that compare element values to numbers or dates are supported. The element containing the date or number must be an XML element that contains only the number or date. Leading and trailing white space are ignored.

String values

Use of the = operator for a string argument in a predicate requires a complete match of all key words in the string with tokens in the identified text span. The order of the tokens is not significant when matching is performed.

DateTime values

Predicates that compare Date or DateTime attributes or elements are supported.

Path expressions:

Table 1. Path expressions
@`xmlxp` Expression	Description
`TagName`	Selects a tag named `TagName`, and all children of that tag.
`@AttributeName`	Selects an attribute named `@AttributeName`.
`/`	Selects from root node.
`//`	Selects matching tags and attributes that are descendants of the current position and match the expression.
`.`	Self: the current tag or element node.

Table 2. Path expression examples:
`@xmlxp` Expression	Result
`/Document`	Returns all documents with a top-level tag `Document`.
`//Document`	Returns all documents with a tag `Document` at any level.
`/Document/Child1`	Returns all documents with a top-level tag `Document` that has a direct child tag `Child1`.
`/Document//Child1`	Returns all documents with a top-level tag `Document` that has a descendant tag `Child1` at any level.
`/Root/@attr1`	Returns all document with a top-level tag `Root` with an attribute `attr1`.
`/Root//@attr1`	Returns all documents with a top-level tag `Root` with an attribute `attr1` on that root tag or any descendant tag.
`//@attr1`	Returns all documents that have an attribute `@attr1` at any level.

Note: The XML Search expression must have an actual tag or attribute name in the relative path expression. / and // by themselves are not valid search queries.

Path expressions are only allowed in the forward direction, and only on a single axis.

It is recommended that a path expression start with a leading/ or //. This indicates that the expression's initial context is the document's root node. When the leading / or // is omitted, the expression is matched at any level. In other words, 'Sentences' is treated as '//Sentences' . The behavior is defined this way to be compatible with prior releases, and does not follow the W3 or SQL/XML standard.

Path expression wildcard support

In the path expression, the special wild-card character * can be used to indicate exactly one tag, with any name.

Trailing path expression wildcards are ignored.

The following uses of path expression wildcards are not supported and result in an error:

An expression that references only wildcards and no specific elements or attributes.
A wildcard attribute at any level: /Tag/@*.
A wildcard that immediately precedes a predicate expression: /Root/*[//anytag].
A wildcard that is used in a predicate comparison: /Root[* > 5].
A wildcard as an XML namespace prefix: //*:tagname.
A wildcard prefixed with an XML namespace prefix: //ns:*.
A wildcard character used as part of a tag name: /start*.

Table 3. Path expression wildcard examples:
`@xmlxp` Expression	Result
`/Root/*/T1`	All documents having a top-level tag `Root` that has a descendant tag `T1` with one intermediate level.
`/Root/*//T1`	All documents having a top-level tag `Root` that has a descendant tag `T1` with one or more intermediate levels.

Predicates

Predicates are used to specify a value or condition that an element or attribute node must satisfy. Predicates are always enclosed in square brackets: [].

Table 4. Predicate examples:
`@xmlxp` Expression	Result
`/Book[Sentences]`	Top-level tag is `Book` and must have a direct child `Sentences`.
`/Book[.//Sentences and .//Author]`	Top-level tag is `Book` and must have both `Sentences` and `Author` descendants. Each descendant can be at any level below `Book`.

Because path expressions are always in the forward direction, and limited to a single access, path expressions in predicates must be relative to the current node. /Book[/Root] and /Book[//Root] are not valid, because in both cases the predicate path expression begins with the top-level tag ‘Root' instead of the current node.

Numeric comparisons

OMNIFINDsupports the =, <=, >=, >, <, and != operators for comparisons of elements and attributes to integers and floating point values.

Elements have only their numeric values indexed if they are simple elements. They must not contain additional characters (other than white space) and must not have any descendant elements. Complex elements are indexed as text only.

Table 5. Numeric comparison examples:
`@xmlxp` Expression	Result
`/Book[@id_num = 12345]`	Top-level tag is `Book` and must have an attribute `id_num` with a value of `12345`.
`/Book[Cost <= 100.50]`	Top-level tag is `Book`. `Book` has a direct child element `Cost` with a numeric value less than or equal to `100.50`.

Date and DateTime comparisons

OMNIFIND supports the =, <=, >=, >, <, and != operators for comparisons of elements and attributes to Date and DateTime values.

Simple elements have only their DateTime values indexed. These elements must not contain additional characters (other than white space) and must not have any descendant elements. Complex elements are indexed as text only.

During indexing, attribute values and text contained within simple XML tags are examined. If the text is determined to match an ISO Date or DateTime format, it is indexed as a Date or DateTime that can be searched in a predicate.

During a search, the Date or DateTime value must be enclosed within an xs:date() or xs:dateTime() function call in order to be recognized as the correct data type.

An XML DateTime data type in an XML document can specify a timezone value. However, when a DateTime is indexed, the Text Search server truncates timezone values during indexing. Therefore, timezones are not considered during XML searches that involve Date or DateTime data types.

In addition, a DateTime with an hour of 24 is permitted only if the minutes and seconds are zero. It will be treated as a value between the last instant of that day and the first instant of the next day.

When a value Date or DateTime is specified in an XML search predicate, a syntax error occurs if a time zone is specified on the value.

The DateTime data type supports up to 12 digits of fractional seconds.

Table 6. Date and DateTime comparison examples:
`@xmlxp` Expression	Result
`/Book[@publishDate > xs:date(“2000-01-01”)]`	Top-level tag is `Book`. `Book` has an attribute `publishDate` that is greater than the date of 2000-01-01.
`/Book[purchaseTime > xs:dateTime(“2009-05-20T13:00:00”)]`	Top-level tag is `Book`. `Book` has a direct child `purchaseTime` that is a DateTime expression greater than 2009-05-20T13:00:00.000000.

Contains and excludes in XML markup

The contains and excludes functions are used to perform full text searches within the XML markup. Contains returns true if the query is contained within the target node; excludes returns true if the query is NOT contained within the target node.

For example, find all documents with a top-level tag called email, and a direct descendant called body that contains variations of the phrase “Department budget”.

@xnkxo:''/email[body contains (“department budget”)]''

The free text passed to the contains or excludes function is handled in the same way as any other free text search. The search is not case-sensitive, and linguistic variations are considered. The earlier query matches “departments budgets” and also “budget for the department”.

The search can be restricted to an exact match by using the traditional quotation marks, for example, @xmlxp:''/email[body contains(“””department budget”””)] ''. The quotes indicating an exact match must be doubled so that they are not interpreted as the end of the contains free text string.

Table 7. Contains and excludes examples:
`@xmlxp` Expression	Result
`/Book[abstract contains(“cat AND dog”)]`	Top-level tag `Book` that has a child tag `abstract` which contains linguistic variations of the terms `cat` and `dog`.
`/Book[abstract contains(“cat AND dog”)] /Book/@title[. contains(“cat OR dog”)]`	Top-level tag `Book` has an attribute `title` that contains linguistic variations of either `cat` or `dog`.
`/Book/Title[. contains(“””All good dogs go to heaven”””)]`	Top-level tag `Book` with a direct child `Title` that contains `all good dogs go to heaven` in order, and without linguistic variations being considered.
`/Book[abstract excludes(“cat AND dog”)]`	Top-level tag `Book` that has a child tag `abstract` which does not contain linguistic variations of the terms `cat` and `dog`.

Complete string match operator

The = operator with a string argument in a predicate calls for a complete match of all tokens in the string with all tokens in the identified text span. Linguistic equivalents are not considered. The order of the terms searched for is not significant. It is not required that the element or attribute contain only the text that was searched for.

Table 8. Complete string match operator examples:
`@xmlxp` Expression	Result
`/Book[@author = “Nicholas Lawrence”]`	Top-level tag `Book` that has an attribute `author`. `author` must contain the terms `Nicholas Lawrence`. Linguistic variations on those terms are not considered matches.
`/Book[author = “””Nicholas Lawrence”””]`	Top-level tag `Book` that has a direct descendant `author`. `author` must contain the terms `Nicholas Lawrence` in order. Linguistic variations on those terms are not considered matches.

Logical Operators

The logical operators and and or can be used in predicates.

Table 9. Logical operator examples:
`@xmlxp` Expression	Result
`/Book[@author = “””Nicholas Lawrence”””]/Price[. < 1000 and @unit = “dollars”]`	Top-level tag `Book` that has an attribute `author`. `author` must contain the terms `Nicholas Lawrence` in order. Linguistic variations on those terms are not considered matches. `Book` must have a direct child `Price` with a value `< 1000`. The `Price` node must have an attribute `@unit` that has a value of `dollars`.

Operator precedence

In XML search predicates, containment operators and comparison operators take precedence over logical operators, and all logical operators have the same precedence.

Containment operators are contains and excludes.
Comparison operators are =, !=, <, >, <= and >=.
Logical operators are and and or.

You can use parentheses to ensure the precedence that you want.