XML search for IBM Text Search for Db2 for z/OS

You can index and search XML documents. The XML search grammar uses a subset of the W3 XPath language with extensions for text search. The extensions support range searches of numeric, date, and datetime values that are associated with an XML attribute or element. Structural elements can be used separately, or combined with free text in queries.

Documents must be indexed to include the XML markup before the index can be searched using the XPath query syntax. Document indexing is done by using the “FORMAT XML” option at index creation time.

Indexes created on a previous release can be used to perform searches. However, documents indexed on a previous release do not have the information necessary to use all the XML search capabilities available in a newer release. Documents added or updated in the text search index after the upgrade to the new release include the additional information.

An upgrade might result in documents indexed on the prior release not being included in some search results. You can use the SYSPROC.SYSTS_UPDATE stored procedure with the ALLROWS option to rebuild the index and resolve this problem.

To use the CONTAINS and SCORE built-in functions to search XML data, the query string must start with the @xmlxp: query prefix. The prefix is followed by a valid XML Search query expression. The @xmlxp 'opaque' term prefix indicates that a search is performed using the query path expression.

For example: CONTAINS(columnname, '@xmlxp:''query_expression'' ').

The single quotes (') surrounding the query_expression must be doubled because they are contained within an SQL string, in effect, a string within a string.

The following list highlights the key features of XML search:

XML structural search

By including special opaque XML terms in queries, you can search XML documents for structural elements and text that is scoped by those elements. Structural elements are tag names, attribute names, and attribute values. Element and tag names are case sensitive.

XML query tokenization

Tokenization is the process of parsing input into tokens. Free text in XML query terms is tokenized the same way that text in non-XML query terms is tokenized. An exception is that nested opaque terms are not supported. Free text search is not case sensitive.

XML Schema and DTD

Any XML schema associated with the XML document is not downloaded, and default values are not indexed.

Numeric values

Predicates that compare attribute or element values to numbers are supported.

Element values

Predicates that compare element values to numbers or dates are supported. The element containing the date or number must be an XML element that contains only the number or date. Leading and trailing white space are ignored.

String values

Use of the = operator for a string argument in a predicate requires a complete match of all key words in the string with tokens in the identified text span. The order of the tokens is not significant when matching is performed.

Datetime values

Predicates that compare date or datetime attributes or elements are supported.

Path expressions

Path expressions are only allowed in the forward direction, and only on a single axis.

You should start path expressions with a leading / or //. These leading characters indicate that the initial context of the expression is the root node of the document. When the leading / or // is omitted, the expression is matched at any level. For example, 'Sentences' is treated as '//Sentences'. The behavior is defined this way to be compatible with prior releases, and does not follow the W3 or SQL/XML standard.

The following tables show the supported path expressions and some examples.

Table 1. Path expressions
@xmlxp Expression Description
TagName Selects a tag named TagName, and all children of that tag.
@AttributeName Selects an attribute named @AttributeName.
/ Selects from root node.
// Selects matching tags and attributes that are descendants of the current position and match the expression.
. Self: the current tag or element node.
Table 2. Path expression examples
@xmlxp Expression Result
/Document Returns all documents with a top-level tag Document.
//Document Returns all documents with a tag Document at any level.
/Document/Child1 Returns all documents with a top-level tag Document that has a direct child tag Child1.
/Document//Child1 Returns all documents with a top-level tag Document that has a descendant tag Child1 at any level.
/Root/@attr1 Returns all document with a top-level tag Root with an attribute attr1.
/Root//@attr1 Returns all documents with a top-level tag Root with an attribute attr1 on that root tag or any descendant tag.
//@attr1 Returns all documents that have an attribute @attr1 at any level.
Note: The XML search expression must have an actual tag or attribute name in the relative path expression. The characters / and // by themselves are not valid search queries.

Wildcard character support

In the path expression, you can use the special wildcard character * to indicate exactly one tag, with any name.

Trailing path expression wildcard characters are ignored.

The following uses of wildcard characters are not supported:
  • An expression that references only wildcard characters and no specific elements or attributes.
  • A wildcard attribute at any level: /Tag/@*.
  • A wildcard character that immediately precedes a predicate expression: /Root/*[//anytag].
  • A wildcard character that is used in a predicate comparison: /Root[* > 5].
  • A wildcard character that is used as an XML namespace prefix: //*:tagname.
  • A wildcard character that is prefixed with an XML namespace prefix: //ns:*.
  • A wildcard character that is used as part of a tag name: /start*.

The following table shows examples of wildcard characters in path expressions.

Table 3. Wildcard character in path expressions
@xmlxp Expression Result
/Root/*/T1 All documents having a top-level tag Root that has a descendant tag T1 with one intermediate level.
/Root/*//T1 All documents having a top-level tag Root that has a descendant tag T1 with one or more intermediate levels.

Predicates

Predicates are used to specify a value or condition that an element or attribute node must satisfy. Predicates are always enclosed in square brackets: [].

Table 4. Predicate examples
@xmlxp Expression Result
/Book[Sentences] Top-level tag is Book and must have a direct child Sentences.
/Book[.//Sentences and .//Author] Top-level tag is Book and must have both Sentences and Author descendants. Each descendant can be at any level below Book.

Because path expressions are always in the forward direction, and limited to a single access, path expressions in predicates must be relative to the current node. /Book[/Root] and /Book[//Root] are not valid, because in both cases the predicate path expression begins with the top-level tag 'Root' instead of the current node.

Numeric comparisons

IBM Text Search for Db2 for z/OS supports the =, <=, >=, >, <, and != operators for comparisons of elements and attributes to integers and floating point values.

Elements have only their numeric values indexed if they are simple elements. Elements must not contain additional characters (other than white space) and must not have any descendant elements. Complex elements are indexed as text only.

Table 5. Numeric comparison examples
@xmlxp Expression Result
/Book[@id_num = 12345] Top-level tag is Book and must have an attribute id_num with a value of 12345.
/Book[Cost <= 100.50] Top-level tag is Book. Book has a direct child element Cost with a numeric value less than or equal to 100.50.

Date and datetime comparisons

IBM Text Search for Db2 for z/OS supports the following operators for comparisons of elements and attributes to date and datetime values:

= <= >= > < !=

Simple elements have only their datetime values indexed. These elements must not contain additional characters (other than white space) and must not have any descendant elements. Complex elements are indexed as text only.

During indexing, attribute values and text contained within simple XML tags are examined. If the text is determined to match an ISO date or datetime format, it is indexed as a date or datetime that can be searched in a predicate.

During a search, the date or datetime value must be enclosed within an xs:date() or xs:dateTime() function call in order to be recognized as the correct data type.

An XML datetime data type in an XML document can specify a timezone value. However, when a datetime is indexed, the Text Search server truncates timezone values during indexing. Therefore, time zones are not considered during XML searches that involve date or datetime data types.

In addition, a datetime with an hour of 24 is permitted only if the minutes and seconds are zero. It will be treated as a value between the last instant of that day and the first instant of the next day.

When a value date or datetime is specified in an XML search predicate, a syntax error occurs if a time zone is specified on the value.

The datetime data type supports up to 12 digits of fractional seconds.

Table 6. Date and datetime comparison examples
@xmlxp Expression Result
/Book[@publishDate > xs:date(“2000-01-01”)] Top-level tag is Book. Book has an attribute publishDate that is greater than the date of 2000-01-01.
/Book[purchaseTime > xs:dateTime(“2009-05-20T13:00:00”)] Top-level tag is Book. Book has a direct child purchaseTime that is a datetime expression greater than 2009-05-20T13:00:00.000000.

Contains and excludes in XML markup

The contains and excludes functions are used to perform full text searches within the XML markup. Contains returns true if the query is contained within the target node; excludes returns true if the query is NOT contained within the target node.

For example, find all documents with a top-level tag called email, and a direct descendant called body that contains variations of the phrase “Department budget”.
@xnkxo:''/email[body contains (“department budget”)]''

The free text passed to the contains or excludes function is handled in the same way as any other free text search. The search is not case-sensitive, and linguistic variations are considered. The earlier query matches “departments budgets” and also “budget for the department”.

The search can be restricted to an exact match by using the traditional quotation marks, for example, @xmlxp:''/email[body contains(“””department budget”””)] ''. The quotes indicating an exact match must be doubled so that they are not interpreted as the end of the contains free text string.

Table 7. Contains and excludes examples
@xmlxp Expression Result
/Book[abstract contains(“cat AND dog”)] Top-level tag Book that has a child tag abstract which contains linguistic variations of the terms cat and dog.
/Book[abstract contains(“cat AND dog”)] /Book/@title[. contains(“cat OR dog”)] Top-level tag Book has an attribute title that contains linguistic variations of either cat or dog.
/Book/Title[. contains(“””All good dogs go to heaven”””)] Top-level tag Book with a direct child Title that contains all good dogs go to heaven in order, and without linguistic variations being considered.
/Book[abstract excludes(“cat AND dog”)] Top-level tag Book that has a child tag abstract which does not contain linguistic variations of the terms cat and dog.

Complete string match operator

The = operator with a string argument in a predicate calls for a complete match of all tokens in the string with all tokens in the identified text span. Linguistic equivalents are not considered. The order of the terms searched for is not significant. It is not required that the element or attribute contain only the text that was searched for.

Table 8. Complete string match operator examples
@xmlxp Expression Result
/Book[@author = “Nicholas Lawrence”] Top-level tag Book that has an attribute author. author must contain the terms Nicholas Lawrence. Linguistic variations on those terms are not considered matches.
/Book[author = “””Nicholas Lawrence”””] Top-level tag Book that has a direct descendant author. author must contain the terms Nicholas Lawrence in order. Linguistic variations on those terms are not considered matches.

Logical operators

The logical operators AND and OR can be used in predicates.

Table 9. Logical operator examples
@xmlxp Expression Result
/Book[@author = “””Nicholas Lawrence”””]/Price[. < 1000 and @unit = “dollars”] Top-level tag Book that has an attribute author. author must contain the terms Nicholas Lawrence in order. Linguistic variations on those terms are not considered matches.

Book must have a direct child Price with a value < 1000. The Price node must have an attribute @unit that has a value of dollars.

Operator precedence

In XML search predicates, containment operators and comparison operators take precedence over logical operators, and all logical operators have the same precedence.

  • Containment operators are contains and excludes.
  • Comparison operators are:
    = != < > <= >=
  • Logical operators are AND and OR.

You can use parentheses to ensure the precedence that you want.