Structural full-text search in XML documents

Db2® Text Search supports using XML search for searching XML documents.

By using a subset of the XPath language with extensions for text search, XML search indexes and searches XML documents. You can use structural elements (tag names, attribute names, and attribute values) separately or combine them with free text in queries.

The following search features are supported by XML search:
  • Boolean operators (basic search)
  • Exact match
  • Fuzzy search
  • Proximity search
  • Stop words
  • Synonyms
  • Wildcard characters
In addition to the search features previously listed, XML search also includes the following key features:
XML structural search
By using XML search syntax in text search queries, you can search XML documents for structural elements (tag names, attribute names, and attribute values) and text that is scoped by those elements. Note that plain searches do not search the attribute field in an XML document.
XML query tokenization
The text that is used in the XML search predicate expression as XML query terms is tokenized the same way that text in non-XML query terms is tokenized, except that spelling corrections, fielded terms, and nested XML search terms are unsupported. Synonyms, wildcard characters, phrases, and lemmatization are supported.
Disregarding of XML namespaces
Namespace prefixes are not retained in the indexing of XML tag and attribute names. You can index and search XML documents by declaring and using namespaces, but namespace prefixes are discarded during indexing and removed from XML search queries.
Numeric values
Predicates comparing attribute values to numbers are supported.
Complete match
The operator = (equal sign) with a string argument in a predicate means that a complete match of all tokens in the string with all tokens in the identified text span is required, with the order being significant.
The subset of XPath that is implemented in XML search differs from standard XPath in the following ways:
  • It does not support iteration and ranges in path expressions.
  • It eliminates filter expressions: that is, it allows filtering only in the predicate expression, not in the path expression.
  • It disallows absolute path names in predicate expressions.
  • It implements only one axis (tag) and allows propagation only in the forward direction.

The following table lists some valid XML search queries.

Table 1. Valid XML search queries
Query Description
/ The root node; any document
/sentences Any document with a top-level tag of sentences
//sentences Any document with a tag at any level of sentences
sentences Any document with a tag at any level of sentences
/sentence/paragraph Any document with a top-level tag of sentences having a direct child tag of paragraph
/sentence/paragraph/ Any document with a top-level tag of sentences having a direct child tag of paragraph
/book/@author Any document with a top-level book tag having an attribute author
/book//@author Any document with a top-level book tag having a descendant tag at any level with attribute author
/book[@author contains("barnes") and @title contains("lemon")] Any document with a top-level book tag with the attributes author and title with values that contain the normalized strings shown
/book[@author contains("barnes") and (@title contains("lemon") or @title contains("flaubert"))] Any document with a top-level book tag with the specified author attribute and either of the two specified title attributes
/program[. contains("""hello, world.""") Any document with a top-level program tag containing at least the tokens hello and world
/book[paragraph contains("flaubert")]//sentence Any document with a top-level tag book tag with a direct child tag of paragraph containing "flaubert" and, referring to the book tag, having a descendant tag sentence at any level
/auto[@price <30000] Any document with a top-level auto tag having an attribute price with a numeric value that is less than 30000
//microbe[@size <3.0e-06] Any document containing a microbe tag at any level with a size attribute with a value that is less than 3.0e-06
Note: The following characters are unsupported in the XML search syntax:
  • /*
  • //*
  • /@*
  • //@*
A plain search does not search the attribute field in the XML document.