XML search

You can index and search XML documents. The XML search grammar uses a subset of the W3 XPath language with extensions for text search. The extensions support range searches of numeric, Date, and DateTime values that are associated with an XML attribute or element. Structural elements can be used separately, or combined with free text in queries.

Documents must be indexed to include the XML markup before the index can be searched using the xmlxp query syntax. Document indexing is done by using the “FORMAT XML” option at index creation time.

Indexes created on a previous release can be used to perform searches. However, documents indexed on a previous release do not have the information necessary to use all the XML search capabilities available in a newer release. Documents added or updated in the text search index after the upgrade to the new release include the additional information.

An upgrade might result in documents indexed on the prior release not being included in some search results. The SYSPROC.SYSTS_REPRIMEINDEX stored procedure can be used to rebuild the index and resolve this problem.

To use the OMNIFIND CONTAINS and SCORE built-in functions to search XML data, the query string must start with the @xmlxp: query prefix. The prefix is followed by a valid XML Search query expression. The @xmlxp 'opaque' term prefix indicates that a search is performed using the query path expression.

For example: CONTAINS(columnname, ‘@xmlxp:''query_expression'' ‘).

The single quotes ‘ ' surrounding the query_expression must be doubled because they are contained within an SQL string, in effect, a string within a string.

The @xpath: opaque term prefix that was used in previous releases of OmniFind Text Search Server for DB2® for i is supported for compatibility with earlier versions. However, it has been deprecated and is not recommended.

The following list highlights the key features of XML search:

XML structural search

By including special opaque XML terms in queries, you can search XML documents for structural elements and text that is scoped by those elements. Structural elements are tag names, attribute names, and attribute values. Element and tag names are case sensitive.

XML query tokenization

Tokenization is the process of parsing input into tokens. Free text in XML query terms is tokenized the same way that text in non-XML query terms is tokenized. An exception is that nested opaque terms are not supported. Free text search is not case sensitive.

XML Schema and DTD

Any XML schema associated with the XML document is not downloaded, and default values are not indexed.

Numeric values

Predicates that compare attribute or element values to numbers are supported.

Element values

Predicates that compare element values to numbers or dates are supported. The element containing the date or number must be an XML element that contains only the number or date. Leading and trailing white space are ignored.

String values

Use of the = operator for a string argument in a predicate requires a complete match of all key words in the string with tokens in the identified text span. The order of the tokens is not significant when matching is performed.

DateTime values

Predicates that compare Date or DateTime attributes or elements are supported.

Path expressions:

Table 1. Path expressions
@xmlxp Expression Description
TagName Selects a tag named TagName, and all children of that tag.
@AttributeName Selects an attribute named @AttributeName.
/ Selects from root node.
// Selects matching tags and attributes that are descendants of the current position and match the expression.
. Self: the current tag or element node.
Table 2. Path expression examples:
@xmlxp Expression Result
/Document Returns all documents with a top-level tag Document.
//Document Returns all documents with a tag Document at any level.
/Document/Child1 Returns all documents with a top-level tag Document that has a direct child tag Child1.
/Document//Child1 Returns all documents with a top-level tag Document that has a descendant tag Child1 at any level.
/Root/@attr1 Returns all document with a top-level tag Root with an attribute attr1.
/Root//@attr1 Returns all documents with a top-level tag Root with an attribute attr1 on that root tag or any descendant tag.
//@attr1 Returns all documents that have an attribute @attr1 at any level.
Note: The XML Search expression must have an actual tag or attribute name in the relative path expression. / and // by themselves are not valid search queries.

Path expressions are only allowed in the forward direction, and only on a single axis.

It is recommended that a path expression start with a leading/ or //. This indicates that the expression's initial context is the document's root node. When the leading / or // is omitted, the expression is matched at any level. In other words, 'Sentences' is treated as '//Sentences' . The behavior is defined this way to be compatible with prior releases, and does not follow the W3 or SQL/XML standard.

Path expression wildcard support

In the path expression, the special wild-card character * can be used to indicate exactly one tag, with any name.

Trailing path expression wildcards are ignored.

The following uses of path expression wildcards are not supported and result in an error:
  • An expression that references only wildcards and no specific elements or attributes.
  • A wildcard attribute at any level: /Tag/@*.
  • A wildcard that immediately precedes a predicate expression: /Root/*[//anytag].
  • A wildcard that is used in a predicate comparison: /Root[* > 5].
  • A wildcard as an XML namespace prefix: //*:tagname.
  • A wildcard prefixed with an XML namespace prefix: //ns:*.
  • A wildcard character used as part of a tag name: /start*.
Table 3. Path expression wildcard examples:
@xmlxp Expression Result
/Root/*/T1 All documents having a top-level tag Root that has a descendant tag T1 with one intermediate level.
/Root/*//T1 All documents having a top-level tag Root that has a descendant tag T1 with one or more intermediate levels.

Predicates

Predicates are used to specify a value or condition that an element or attribute node must satisfy. Predicates are always enclosed in square brackets: [].

Table 4. Predicate examples:
@xmlxp Expression Result
/Book[Sentences] Top-level tag is Book and must have a direct child Sentences.
/Book[.//Sentences and .//Author] Top-level tag is Book and must have both Sentences and Author descendants. Each descendant can be at any level below Book.

Because path expressions are always in the forward direction, and limited to a single access, path expressions in predicates must be relative to the current node. /Book[/Root] and /Book[//Root] are not valid, because in both cases the predicate path expression begins with the top-level tag ‘Root' instead of the current node.

Numeric comparisons

OMNIFINDsupports the =, <=, >=, >, <, and != operators for comparisons of elements and attributes to integers and floating point values.

Elements have only their numeric values indexed if they are simple elements. They must not contain additional characters (other than white space) and must not have any descendant elements. Complex elements are indexed as text only.

Table 5. Numeric comparison examples:
@xmlxp Expression Result
/Book[@id_num = 12345] Top-level tag is Book and must have an attribute id_num with a value of 12345.
/Book[Cost <= 100.50] Top-level tag is Book. Book has a direct child element Cost with a numeric value less than or equal to 100.50.

Date and DateTime comparisons

OMNIFIND supports the =, <=, >=, >, <, and != operators for comparisons of elements and attributes to Date and DateTime values.

Simple elements have only their DateTime values indexed. These elements must not contain additional characters (other than white space) and must not have any descendant elements. Complex elements are indexed as text only.

During indexing, attribute values and text contained within simple XML tags are examined. If the text is determined to match an ISO Date or DateTime format, it is indexed as a Date or DateTime that can be searched in a predicate.

During a search, the Date or DateTime value must be enclosed within an xs:date() or xs:dateTime() function call in order to be recognized as the correct data type.

An XML DateTime data type in an XML document can specify a timezone value. However, when a DateTime is indexed, the Text Search server truncates timezone values during indexing. Therefore, timezones are not considered during XML searches that involve Date or DateTime data types.

In addition, a DateTime with an hour of 24 is permitted only if the minutes and seconds are zero. It will be treated as a value between the last instant of that day and the first instant of the next day.

When a value Date or DateTime is specified in an XML search predicate, a syntax error occurs if a time zone is specified on the value.

The DateTime data type supports up to 12 digits of fractional seconds.

Table 6. Date and DateTime comparison examples:
@xmlxp Expression Result
/Book[@publishDate > xs:date(“2000-01-01”)] Top-level tag is Book. Book has an attribute publishDate that is greater than the date of 2000-01-01.
/Book[purchaseTime > xs:dateTime(“2009-05-20T13:00:00”)] Top-level tag is Book. Book has a direct child purchaseTime that is a DateTime expression greater than 2009-05-20T13:00:00.000000.

Contains and excludes in XML markup

The contains and excludes functions are used to perform full text searches within the XML markup. Contains returns true if the query is contained within the target node; excludes returns true if the query is NOT contained within the target node.

For example, find all documents with a top-level tag called email, and a direct descendant called body that contains variations of the phrase “Department budget”.
@xnkxo:''/email[body contains (“department budget”)]''

The free text passed to the contains or excludes function is handled in the same way as any other free text search. The search is not case-sensitive, and linguistic variations are considered. The earlier query matches “departments budgets” and also “budget for the department”.

The search can be restricted to an exact match by using the traditional quotation marks, for example, @xmlxp:''/email[body contains(“””department budget”””)] ''. The quotes indicating an exact match must be doubled so that they are not interpreted as the end of the contains free text string.

Table 7. Contains and excludes examples:
@xmlxp Expression Result
/Book[abstract contains(“cat AND dog”)] Top-level tag Book that has a child tag abstract which contains linguistic variations of the terms cat and dog.
/Book[abstract contains(“cat AND dog”)] /Book/@title[. contains(“cat OR dog”)] Top-level tag Book has an attribute title that contains linguistic variations of either cat or dog.
/Book/Title[. contains(“””All good dogs go to heaven”””)] Top-level tag Book with a direct child Title that contains all good dogs go to heaven in order, and without linguistic variations being considered.
/Book[abstract excludes(“cat AND dog”)] Top-level tag Book that has a child tag abstract which does not contain linguistic variations of the terms cat and dog.

Complete string match operator

The = operator with a string argument in a predicate calls for a complete match of all tokens in the string with all tokens in the identified text span. Linguistic equivalents are not considered. The order of the terms searched for is not significant. It is not required that the element or attribute contain only the text that was searched for.

Table 8. Complete string match operator examples:
@xmlxp Expression Result
/Book[@author = “Nicholas Lawrence”] Top-level tag Book that has an attribute author. author must contain the terms Nicholas Lawrence. Linguistic variations on those terms are not considered matches.
/Book[author = “””Nicholas Lawrence”””] Top-level tag Book that has a direct descendant author. author must contain the terms Nicholas Lawrence in order. Linguistic variations on those terms are not considered matches.

Logical Operators

The logical operators and and or can be used in predicates.

Table 9. Logical operator examples:
@xmlxp Expression Result
/Book[@author = “””Nicholas Lawrence”””]/Price[. < 1000 and @unit = “dollars”] Top-level tag Book that has an attribute author. author must contain the terms Nicholas Lawrence in order. Linguistic variations on those terms are not considered matches.

Book must have a direct child Price with a value < 1000. The Price node must have an attribute @unit that has a value of dollars.

Operator precedence

In XML search predicates, containment operators and comparison operators take precedence over logical operators, and all logical operators have the same precedence.

  • Containment operators are contains and excludes.
  • Comparison operators are =, !=, <, >, <= and >=.
  • Logical operators are and and or.

You can use parentheses to ensure the precedence that you want.