XML search
You can index and search XML documents. The XML search grammar uses a subset of the W3 XPath language with extensions for text search. The extensions support range searches of numeric, Date, and DateTime values that are associated with an XML attribute or element. Structural elements can be used separately, or combined with free text in queries.
Documents must be indexed to include the XML markup before the index can be searched using the xmlxp query syntax. Document indexing is done by using the “FORMAT XML” option at index creation time.
Indexes created on a previous release can be used to perform searches. However, documents indexed on a previous release do not have the information necessary to use all the XML search capabilities available in a newer release. Documents added or updated in the text search index after the upgrade to the new release include the additional information.
An upgrade might result in documents indexed on the prior release not being included in some search results. The SYSPROC.SYSTS_REPRIMEINDEX stored procedure can be used to rebuild the index and resolve this problem.
To use the OMNIFIND CONTAINS
and SCORE built-in functions to search XML data, the query string
must start with the @xmlxp:
query prefix. The prefix
is followed by a valid XML Search query expression. The @xmlxp
'opaque'
term prefix indicates that a search is performed using the query path
expression.
For example: CONTAINS(columnname, ‘@xmlxp:''query_expression''
‘)
.
The single quotes ‘ '
surrounding
the query_expression must be doubled because they are contained
within an SQL string, in effect, a string within a string.
The @xpath:
opaque
term prefix that was used in previous releases of OmniFind Text Search Server for DB2® for i is supported for compatibility
with earlier versions. However, it has been deprecated and is not
recommended.
The following list highlights the key features of XML search:
XML structural search
By including special opaque XML terms in queries, you can search XML documents for structural elements and text that is scoped by those elements. Structural elements are tag names, attribute names, and attribute values. Element and tag names are case sensitive.
XML query tokenization
Tokenization is the process of parsing input into tokens. Free text in XML query terms is tokenized the same way that text in non-XML query terms is tokenized. An exception is that nested opaque terms are not supported. Free text search is not case sensitive.
XML Schema and DTD
Any XML schema associated with the XML document is not downloaded, and default values are not indexed.
Numeric values
Predicates that compare attribute or element values to numbers are supported.
Element values
Predicates that compare element values to numbers or dates are supported. The element containing the date or number must be an XML element that contains only the number or date. Leading and trailing white space are ignored.
String values
Use
of the =
operator for a string argument in a predicate
requires a complete match of all key words in the string with tokens
in the identified text span. The order of the tokens is not significant
when matching is performed.
DateTime values
Predicates that compare Date or DateTime attributes or elements are supported.
Path expressions:
@xmlxp Expression |
Description |
---|---|
TagName |
Selects a tag named TagName ,
and all children of that tag. |
@AttributeName |
Selects an attribute named @AttributeName . |
/ |
Selects from root node. |
// |
Selects matching tags and attributes that are descendants of the current position and match the expression. |
. |
Self: the current tag or element node. |
@xmlxp Expression |
Result |
---|---|
/Document |
Returns all documents with a top-level tag Document . |
//Document |
Returns all documents with a tag Document at
any level. |
/Document/Child1 |
Returns all documents with a top-level tag Document that
has a direct child tag Child1 . |
/Document//Child1 |
Returns all documents with a top-level tag Document that
has a descendant tag Child1 at any level. |
/Root/@attr1 |
Returns all document with a top-level tag Root with
an attribute attr1 . |
/Root//@attr1 |
Returns all documents with a top-level tag Root with
an attribute attr1 on that root tag or any descendant
tag. |
//@attr1 |
Returns all documents that have an attribute @attr1 at
any level. |
/
and //
by
themselves are not valid search queries. Path expressions are only allowed in the forward direction, and only on a single axis.
It
is recommended that a path expression start with a leading/
or //
.
This indicates that the expression's initial context is the document's
root node. When the leading /
or //
is
omitted, the expression is matched at any level. In other words,
'Sentences' is treated as '//Sentences'
. The behavior
is defined this way to be compatible with prior releases, and does
not follow the W3 or SQL/XML standard.
Path expression wildcard support
In the
path expression, the special wild-card character *
can
be used to indicate exactly one tag, with any name.
Trailing path expression wildcards are ignored.
- An expression that references only wildcards and no specific elements or attributes.
- A wildcard attribute at any level:
/Tag/@*
. - A wildcard that immediately precedes a predicate expression:
/Root/*[//anytag]
. - A wildcard that is used in a predicate comparison:
/Root[* > 5]
. - A wildcard as an XML namespace prefix:
//*:tagname
. - A wildcard prefixed with an XML namespace prefix:
//ns:*
. - A wildcard character used as part of a tag name:
/start*
.
@xmlxp Expression |
Result |
---|---|
/Root/*/T1 |
All documents having a top-level tag Root that
has a descendant tag T1 with one intermediate level. |
/Root/*//T1 |
All documents having a top-level tag Root that
has a descendant tag T1 with one or more intermediate
levels. |
Predicates
Predicates are used to specify
a value or condition that an element or attribute node must satisfy.
Predicates are always enclosed in square brackets: []
.
@xmlxp Expression |
Result |
---|---|
/Book[Sentences] |
Top-level tag is Book and must
have a direct child Sentences . |
/Book[.//Sentences and .//Author] |
Top-level tag is Book and must
have both Sentences and Author descendants.
Each descendant can be at any level below Book . |
Because path expressions are always in the forward direction,
and limited to a single access, path expressions in predicates must
be relative to the current node. /Book[/Root]
and /Book[//Root]
are
not valid, because in both cases the predicate path expression begins
with the top-level tag ‘Root' instead of the current node.
Numeric comparisons
OMNIFINDsupports
the =
, <=
, >=
, >
, <
,
and !=
operators for comparisons of elements and
attributes to integers and floating point values.
Elements have only their numeric values indexed if they are simple elements. They must not contain additional characters (other than white space) and must not have any descendant elements. Complex elements are indexed as text only.
@xmlxp Expression |
Result |
---|---|
/Book[@id_num = 12345] |
Top-level tag is Book and must
have an attribute id_num with a value of 12345 . |
/Book[Cost <= 100.50] |
Top-level tag is Book . Book has
a direct child element Cost with a numeric value
less than or equal to 100.50 . |
Date and DateTime comparisons
OMNIFIND supports
the =
, <=
, >=
, >
, <
,
and !=
operators for comparisons of elements and
attributes to Date and DateTime values.
Simple elements have only their DateTime values indexed. These elements must not contain additional characters (other than white space) and must not have any descendant elements. Complex elements are indexed as text only.
During indexing, attribute values and text contained within simple XML tags are examined. If the text is determined to match an ISO Date or DateTime format, it is indexed as a Date or DateTime that can be searched in a predicate.
During a search, the Date or DateTime value must
be enclosed within an xs:date()
or xs:dateTime()
function
call in order to be recognized as the correct data type.
An XML DateTime data type in an XML document can specify a timezone value. However, when a DateTime is indexed, the Text Search server truncates timezone values during indexing. Therefore, timezones are not considered during XML searches that involve Date or DateTime data types.
In
addition, a DateTime with an hour of 24
is permitted
only if the minutes and seconds are zero. It will be treated as a
value between the last instant of that day and the first instant of
the next day.
When a value Date or DateTime is specified in an XML search predicate, a syntax error occurs if a time zone is specified on the value.
The DateTime data type supports up to 12 digits of fractional seconds.
@xmlxp Expression |
Result |
---|---|
/Book[@publishDate > xs:date(“2000-01-01”)] |
Top-level tag is Book . Book has
an attribute publishDate that is greater than the
date of 2000-01-01. |
/Book[purchaseTime > xs:dateTime(“2009-05-20T13:00:00”)] |
Top-level tag is Book . Book has
a direct child purchaseTime that is a DateTime expression
greater than 2009-05-20T13:00:00.000000. |
Contains and excludes in XML markup
The contains and excludes functions are used to perform full text searches within the XML markup. Contains returns true if the query is contained within the target node; excludes returns true if the query is NOT contained within the target node.
email
,
and a direct descendant called body
that contains
variations of the phrase “Department budget”. @xnkxo:''/email[body contains (“department budget”)]''
The free text passed to the contains or excludes function is handled in the same way as any other free text search. The search is not case-sensitive, and linguistic variations are considered. The earlier query matches “departments budgets” and also “budget for the department”.
The search can be restricted to an exact match
by using the traditional quotation marks, for example, @xmlxp:''/email[body
contains(“””department budget”””)] ''
. The quotes
indicating an exact match must be doubled so that they are not interpreted
as the end of the contains free text string.
@xmlxp Expression |
Result |
---|---|
/Book[abstract contains(“cat AND dog”)] |
Top-level tag Book that has
a child tag abstract which contains linguistic variations
of the terms cat and dog . |
/Book[abstract contains(“cat AND dog”)]
/Book/@title[. contains(“cat OR dog”)] |
Top-level tag Book has an attribute title that
contains linguistic variations of either cat or dog . |
/Book/Title[. contains(“””All good dogs
go to heaven”””)] |
Top-level tag Book with a direct
child Title that contains all good dogs go
to heaven in order, and without linguistic variations being
considered. |
/Book[abstract excludes(“cat AND dog”)] |
Top-level tag Book that has
a child tag abstract which does not contain linguistic
variations of the terms cat and dog . |
Complete string match operator
The =
operator
with a string argument in a predicate calls for a complete match of
all tokens in the string with all tokens in the identified text span.
Linguistic equivalents are not considered. The order of the terms
searched for is not significant. It is not required that the element
or attribute contain only the text that was searched for.
@xmlxp Expression |
Result |
---|---|
/Book[@author = “Nicholas Lawrence”] |
Top-level tag Book that has
an attribute author . author must
contain the terms Nicholas Lawrence . Linguistic
variations on those terms are not considered matches. |
/Book[author = “””Nicholas Lawrence”””] |
Top-level tag Book that has
a direct descendant author . author must
contain the terms Nicholas Lawrence in order. Linguistic
variations on those terms are not considered matches. |
Logical Operators
The logical operators and and or can be used in predicates.
@xmlxp Expression |
Result |
---|---|
/Book[@author = “””Nicholas Lawrence”””]/Price[. <
1000 and @unit = “dollars”] |
Top-level tag Book that has
an attribute author . author must
contain the terms Nicholas Lawrence in order. Linguistic
variations on those terms are not considered matches.
|
Operator precedence
In XML search predicates, containment operators and comparison operators take precedence over logical operators, and all logical operators have the same precedence.
- Containment operators are contains and excludes.
- Comparison operators are
=
,!=
,<
,>
,<=
and>=
. - Logical operators are and and or.
You can use parentheses to ensure the precedence that you want.