With the support of natively storing XML documents, IBM® DB2® 9 takes an evolutionary step towards a hybrid database system. XQuery, a new primary language in DB2, allows for optimally working with XML. However, XML also often contains large portions of text that are not easily searchable with XQuery. To efficiently search these unstructured parts of the documents, DB2 provides a full-text search solution. This article focuses on the basics of DB2 text search, including text index creation and administration. This article also explores specific features of DB2 full-text search, such as searching for keywords or phrases in the XML document structure with fuzzy search, proximity search, and stemming.

Share:

Holger Seubert (hseubert@de.ibm.com), DB2 Text Search Development, IBM, Software Group

Holger SeubertHolger Seubert is an IBM Software Engineer who focuses on XML technology in DB2. Being located in IBM's development laboratory in Boeblingen, Germany, he started working on mapping technologies of XML documents to and from IBM's hierarchical database system (IMS). Holger is now working on the design and implementation of full-text search solutions in DB2 on XML.



Sabine Perathoner-Tschaffler (spera@de.ibm.com), DB2 Text Search Development, IBM, Software Group

SabineSabine Perathoner-Tschaffler is a Software Engineer at IBM's Boeblingen Laboratory. She works in the area of full-text search in DB2 with focus on XML enablement.



29 June 2006

Also available in Chinese Vietnamese

XQuery, the W3C standard for accessing XML data, offers many different expressions for navigating and querying XML documents. Can you also use the power of this language to find XML documents based on full-text search criteria? The answer is "not yet." There is no support for full-text search features in XQuery today. The reason for this lies in the XQuery Data Model, which serves as the basis for every XQuery expression. In order to be able to efficiently search full-text data, information about the individual words making up the textual content of the data and their location is needed. However, in the XQuery Data Model, text is simply a "black box." The data model has no notions of words or tokens inside a text node; it just represents the textual content of an element or attribute as one contiguous node.

Therefore, XQuery can offer nothing better than sub-string matching. This functionality is implemented with the contains() built-in function, which takes a context and a search term as parameters.

As DB2 9 supports XQuery as primary language, the following XQuery expression can be executed in a DB2 command window:

db2 => XQuery contains("The power of XML", "owe")

As previously mentioned, the contains() function implements sub-string matching and therefore returns the value true for the above query. Although the user searches for the term "owe," results for "power" and so on are also returned by contains(), which may lead to search results you are actually not looking for. There is no standard yet to include text search functionally into the XQuery language. Clearly, this is a post XQuery, Version 1 task that has already started in a separate working group: XQuery 1.0 and XPath 2.0 full text.

While waiting for this new standard, what can you do today to find relevant XML documents based on keywords and phrases? The following section gives a brief overview of DB2's full-text search functionality that makes structure-aware text search on XML possible.

DB2 Net Search Extender in a nutshell

When it comes to text searching XML documents, both the actual content and the document structure is considered. A typical scenario is to limit text search to a specific element or sub-tree of the overall XML document structure, like:

"Find all documents with the term 'XML' and 'database' in the abstract element, the phrase 'text search' and 'Viper Release' in the title element, and both terms should be in the same sentence."

Such a structure-aware, full-text search on XML documents in DB2 is implemented with DB2 Net Search Extender (NSE), a separately installed feature that ships with DB2.

Text search is integrated into SQL and optimized by the DB2 optimizer for run time. Administrative tasks, such as index creation and maintenance, can be performed using the DB2 Control Center.

Using SQL, Net Search Extender searches textual data that is stored in the column of a database table. Rather than sequentially searching through the text data using string matching, as done with the XQuery contains() function, Net Search Extender uses a text index. Using a text index makes document search much more efficient, as the text data does not need to be examined at query time like it is done with sub-string matching. A text index typically consists of information about relevant terms that are extracted from the text documents. In XML cases, not only the significant terms, but also their location in the XML document structure is maintained in the text index.

Now that you are familiar with the basic principle of Net Search Extender, you are ready to learn how to use it to perform full-text searches on XML data. Let's explore how to create a text index in the next section.

Getting ready for text search

Tasks that need to be performed to search within XML documents in DB2 can be split into two parts:

  1. Index administration tasks
  2. full-text search

This section steps through both types of tasks shortly using examples that refer to the "product" table as defined in Listing 1. Note that the "description" and "comments" column of the "product" table are of type XML (in other words, the "description" and the "comments" XML documents are stored natively in the database).

Listing 1. Table definition
create table product (
    pid varchar(10) primary key not null,
    description xml,
    comments xml
    );

Sample XML data stored in the "product.description" and "product.comment" columns are shown in Figure 1 and Figure 2 respectively. Subsequent query examples search in specific elements of these XML documents. How to insert the XML documents into DB2 is covered in the article "Get off to a fast start with DB2 Viper" (developerWorks, March 2006)..

Figure 1. Sample data of "product.description" column
Product sample XML data
Figure 2. Sample data of "product.comments" column
Comments sample XML data

Let's first consider what needs to be done to perform text search operations on the "product" table, like searching for the name of a product with the term "satisfactory" in the "message" element of its comments or a product with the phrase "sugar and creamer" in the "details" element of its description.

Before you can use Net Search Extender, you have to start its instance services, as shown in Listing 2.

Listing 2. Command for starting the Net Search Extender instance services
db2text start

Note that this command invokes a separate executable called db2text. As you will see, any text index administrative task needs to be done with the command db2text.

After starting Net Search Extender, enable the DB2 database for text search operations. This step creates necessary administration tables and various User Defined Functions (UDFs) and Stored Procedures (STPs) that are needed for full-text search on DB2 data and is executed only once per database.

The command for enabling a database "test" for text search operations looks as follows:

Listing 3. Command for enabling the database "test" for text search operation
db2text enable database for text connect to test

In case you have not set the DB2DBDFT environment variable, all Net Search Extender administrative tasks require the connect to <dbname> clause, where <dbname> is the name of your database. The following commands assume that DB2DBDFT is set.

After enabling a database for text, you can create text indexes on columns storing textual data in various formats, such as XML. The create index command establishes the text index infrastructure by defining the text index attributes, such as update frequency, a document format, an index directory, and so on. For more information on the various index attributes, refer to the Net Search Extender documentation. The example used in this article uses a basic text index creation, as shown in Listing 4:

Listing 4. Creation of a text index ix1 on "product.description"
 db2text create index ix1 for text on product(description)

The command in Listing 4 creates a full-text index named "ix1" on the XML documents that are stored natively in column "description" of table "product."

Note that after you create a text index, text search does not find any results, as the text index contains no data at this point. Your data and the text index need to be explicitly synchronized if no update frequency is specified. The task of synchronizing the text index is called index update and typically follows right after you have created a text index.

Listing 5. Synchronize the text index with the data in the base table
db2text update index ix1 for text

The update index process can either be done manually using the command shown in Listing 5 or automatically scheduled using a background daemon. An automatic index update can be specified as an index attribute during the index creation or specified later using an alter index command.

Summarized below are the steps that need to be completed to prepare full-text search operations in DB2:

Table 1. Steps to prepare full-text search in DB2
StepsDescriptionExample
Step 1Start Net Search Extender instance servicesdb2text start
Step 2Enable your database for full-text searchdb2text enable database for text
Step 3Create a full-text indexdb2text create index ixName for text on table(column)
Step 4Update the previously created full-text indexdb2text update index ixName for text

After completing these four administrative steps, you are ready to go on and full-text search your XML data. Now, what are the actual search features that can be used to find relevant information?


Find what you are looking for

The most common way of performing full-text search with Net Search Extender is to use its SQL scalar search function CONTAINS(). Use this where standard SQL would be used and where you can easily combine it with other conditions in an SQL WHERE clause. Additionally, a tight integration into the DB2 optimizer provides for strong run time performance. (Two other search methods are also available for more specialized situations, but this article does not discuss them.) The following sample shows how to use the scalar CONTAINS() function to issue a full-text search:

Listing 6. The scalar CONTAINS function
SELECT column
FROM table
WHERE CONTAINS(column-name, 'search-criteria')=1;

The CONTAINS() function takes two parameters: the column name, establishing the context in which full-text search should occur, and the actual search criteria that should be applied.

Among the basic search criteria of Net Search Extender are the following features:

  • Boolean operations for conjunction (AND), disjunction (OR), and exclusion (NOT) of search terms
  • Individual search terms may be single words or phrases, for example "New" York" or "New York"
  • Proximity search for words in the same sentence or the same paragraph
  • Fuzzy search for words with a similar spelling as the search term
  • Wildcard search, using front, middle, and end masking

More advanced search features offered by Net Search Extender are:

  • Thesaurus support for broader queries, by searching not only for a specific search term, but also for terms that are related to it with user-defined relations
  • Search on numeric ranges, which could be either in structured documents or within additional columns
  • Stemming to reduce the search term to its word stem before the search is carried out (English only)

For realizing a targeted search within the document structure of XML documents, Net Search Extender offers a section based search by:

  • limiting search to XML elements
  • limiting search to XML attributes
  • supporting mixed content type of XML elements

Now that you have seen the basic ideas of the search features offered by Net Search Extender, let's explore their actual use by example.

Text Search using SQL and XQuery

The following queries access the "product" table that was introduced in Listing 1. Sample XML data stored in the "product.description" column is shown in Figure 1. Subsequent query examples search in specific elements of this XML column.

Consider the following simple text search on the "product" table that returns all product identifiers ("pid" data) with the term "creamer" somewhere in the document structure of the product information (in column "product.description"):

Listing 7. Basic syntax for text search
SELECT pid
FROM product
WHERE CONTAINS(description,' "creamer" ')=1;

Using the SQL/XML function XMLQUERY(), individual XML elements of the product information can be projected into the result. The following query returns the "name" element of the product that has the term "creamer" somewhere in the document structure:

Listing 8. Use of text search functionality in combination with XMLQUERY()
SELECT XMLQUERY('$prod//name' passing description as "prod")
FROM product
WHERE CONTAINS(description, ' "creamer" ')=1;

The same SQL full-text search query, as shown in Listing 8 above, can be expressed in XQuery as follows:

Listing 9. Text search in the XQuery context
XQUERY for $prod in db2-fn:sqlquery("SELECT description
  FROM product
  WHERE CONTAINS (description, ' "creamer" ')=1")
return $prod//name

The full-text search expression is expressed in an SQL context, and the relevant XML documents are processed in the XQuery context using the db2-fn:sqlquery() input function.

In the previous examples, text search is performed on the complete XML document structure (in other words, in all the text nodes and attribute values regardless of where the search criteria are met in the document). What about limiting text search to specific XML elements or XML attributes?

Limit text search to XML elements or attributes

Limiting text search to specific parts within the document structure is one of the most frequently used features for XML full-text search. In DB2 Net Search Extender, limiting text search to specific elements or sub-trees of the XML documents is expressed by a fully qualified XPath. This XPath identifies the part of the XML document structure that should be searched.

Net Search Extender supports the abbreviated XPath location-step syntax and allows for the use of the child axis (/) and the attribute axis (@). Other XPath expressions or functions are not supported.

For example, text search for the terms "creamer" and "pattern" can be limited to the "name" element of the product XML document in "product.description" with the following query:

Listing 10. Limit search to XML elements or attributes with the SECTION clause
SELECT XMLQUERY('$prod//name' passing description as "prod")
FROM product
WHERE CONTAINS(description,'
             SECTION("/product/description/name") ("creamer","pattern") ')=1;

The SECTION clause as part of the text search criteria identifies the part within the XML document structure where text search shall occur. With location steps, as defined in the XPath standard, the XML "name" element within the product XML structure is identified in the above example. The SELECT statement returns all name elements of the products with the term "creamer" and the term "pattern" in any sequence in the respective "name" element.

With XQuery as primary language, the same query looks as follows:

Listing 11. XQuery as primary language
XQUERY for $prod in db2-fn:sqlquery("SELECT description
     FROM product
     WHERE CONTAINS(description,
                    ' SECTION("/product/description/name") ("creamer","pattern" ')=1)
  return $prod//name

Additionally, text search can also be limited to specific XML attributes, using the attribute axis, as shown in the following example:

Listing 12. Limit text search to XML attributes
SELECT XMLQUERY('$prod//name' passing description as "prod")
FROM product
WHERE CONTAINS(description,'
                SECTION("/product/description/category/@catx")"sterling" ')=1;

The SECTION clause specifies to limit text search to the "catx" XML attribute using the corresponding XPath. The "name" element of all products with the term "sterling" as value of the "catx" XML attribute are returned.

Listing 12 shows how search for different terms can be done in the same section. Search within different sections can also be combined. Using the Boolean AND operator (&), the following query in Listing 13 returns all "name" elements of those products that have the term "sterling" in the "catx" attribute of element "category" and the term "creamer" in the "name" element:

Listing 13. Searching in different sections
SELECT XMLQUERY('$prod//name' passing description as "prod")
FROM product
WHERE CONTAINS(description,' SECTION("/product/description/category/@catx")"sterling"
                    & SECTION("/product/description/name")"creamer" ')=1;

Searching across element boundaries

Sometimes you encounter XML elements with a mixed content type that contains textual data and also other XML elements. For example, the following XML element "sentence" contains some text and another element "entity":

Figure 3. XML element of a mixed content type
Sample of mixed content type

In this case, you want to search across element boundaries (in other words, searching for the phrase "IBM Software Group" in the element "sentence" should find results, although parts of the search term occur in different elements). Net Search Extender is capable of handling mixed content -- searching across element boundaries.

Considering the above XML fragment is stored in a column "description" in table "product," the following query returns a result:

Listing 14. Query results
SELECT description
FROM product
WHERE CONTAINS(description,'
               SECTION("/sentence") "IBM Software Group" ')=1;

Although the token "IBM" is the direct Text Node of element "entity," and the scope for the text search is XML element "sentence," the query returns a hit.

So far, the power of full-text indexes is not fully exploited yet. Having seen examples of how to limit search to specific parts in the document structure, Net Search Extender offers more ways of expressing search criteria.

Exploiting the power of text search

In addition to limiting text search to specific elements or attributes, Net Search Extender offers several other features for defining text search criteria. For example, you can limit text search to terms that match only if they occur in the same sentence (proximity search).

The following sections provide examples of the different search features that were previously introduced.

The query in Listing 15 finds all product identifiers of those products with the terms "pattern" and "sugar and creamer" not only in the "details" element of the XML document, but also in the same sentence:

Listing 15. Proximity search
SELECT pid
FROM product
WHERE CONTAINS(description, 'SECTION("/product/description/details") "pattern"
               IN SAME SENTENCE AS "sugar and creamer"')=1;

Two structural constraints are combined in this example: search is limited to a specific element ("details"), and within the textual content of that element (its text node), search is also limited to terms occurring in the same sentence. Note that one search term, "sugar and creamer," comprises a phrase in which the single words "sugar," "and," and "creamer" are searched for in exactly this sequence. On the other hand, searching for the tokens "creamer," "and," and "sugar" in any sequence in the same context can be expressed as follows:

Listing 16. The difference between phrase search and token search
SELECT pid
FROM product
WHERE CONTAINS(description, 'SECTION("/product/description/details") "pattern"
                IN SAME SENTENCE AS ("creamer","and","sugar")')=1;

Boolean operators

Different search terms can be combined with other search terms using the Boolean operators "&" (AND), "|" (OR), and NOT. The following example combines several terms by using the Boolean operators AND and OR:

Listing 17. Using Boolean operators AND and OR
SELECT pid
FROM product
WHERE CONTAINS(description, 'SECTION("/product/description") "pattern" & "creamer" | "sugar" ')=1;

This query returns all "pid"s of those products that have the term "pattern" and "creamer" or "sugar" in the description element.

The Boolean operator NOT can be used to exclude particular terms from search results. For example, the following query in Listing 18 searches for documents having the term "pattern" and "creamer" and not "sugar" in the "description" element:

Listing 18. Boolean operator NOT
SELECT pid
FROM product
WHERE CONTAINS(description,
         'SECTION("/product/description") "pattern" & "creamer" & NOT "sugar" ')=1;

Fuzzy search

Another popular feature is fuzzy search. Fuzzy search finds documents that contain the search term spelled in a similar way to the submitted search term. Thereby, the first three characters need to match. It is often used when misspellings are possible in the document.

Listing 19. Fuzzy search to find similar spelled terms
SELECT pid
FROM product
WHERE CONTAINS(description,
      'SECTION("/product/description") FUZZY FORM OF 80 "patern" & "creamer"')=1;

Notice that the term "patern" is misspelled, but with a fuzzy search, documents containing the correctly spelled term "pattern" are found. The match level, in this example 80, specifies the degree of accuracy. The degree of accuracy can vary between 1 and 100, whereby 100 is an exact match.

Stemming

Searching for the stemmed form of a term causes the term to be reduced to its word stem prior to the search. The following example searches for the fuzzy form of "patern" or the stemmed form of "creamy." The stemmed search returns documents having terms like "creamer," "creamed," or "cream" in their textual content.

Listing 20. Stemmed search for "creamy"
SELECT pid
FROM product
WHERE CONTAINS(description, 'SECTION("/product/description")
           FUZZY FORM OF 42 "patern" | STEMMED FORM OF "creamy"')=1;

Wildcard

Net Search Extender supports two kinds of character masking for wildcard search:

  • mask any single character (_)
  • mask any number of arbitrary characters (%)

A sample query using wildcard search is shown in Listing 21.

Listing 21. Search with wildcards
SELECT pid
FROM product
WHERE CONTAINS(description, 'SECTION("/product/description") "pat_ern" & "cream%" ')=1;

The query will match documents that contain the terms "pattern" and "creamer," for example, in the "description" element.

One thing you always need to consider...

Maybe you are wondering why we have not searched within the "comments" column of the "product" table so far. The reason for this is to put emphasis on one important aspect of searching within XML documents: Always the complete XML document is returned as a search result, not the specific document part or element where the hit occurs.

This aspect is addressed on the basis of the "comments" XML document shown in Figure 2. These documents have, compared to the "product" documents, a different structural characteristic. Within a "comments" element, there are multiple, repeating "comment" elements.

For example, searching for the name of a product with the stemmed form of the term "disappointing" in the "message" element of its comment can be expressed as follows:

Listing 22. Search for product name with stemmed form of "disappointing" in "message" element of its comment
SELECT XMLQUERY('$prod//name' passing description as "prod")
FROM product
WHERE CONTAINS(comments, 'SECTION("/comments/comment/message")
                          STEMMED FORM OF "disappointing"')=1

The query returns the name of the product in the following manner:

<name>Sterling Sugar & Creamer by Gorham</name>
   1 record(s) selected.

Search on XML documents with repeating document parts works without any further considerations, as long as you are not directly accessing the repeating parts, like the "comment" elements on the "comments" column in our example.

However, if you would like to retrieve the "customerID" of the customer who made a comment with the term "disappointment" in the message element, a first approach to find this information could look as follows:

Listing 23. Retrieve "customerID" of customer who made comment with term "disappointment" in message element
XQuery for $co in db2-fn:sqlquery("SELECT comments
                    FROM product
                    WHERE CONTAINS(comments,
                         'SECTION("/comments/comment/message")
                          STEMMED FORM OF "disappointed"')=1")
       return $co//customerID

The returned sequence of XML elements is:

Listing 24. Returned sequence of XML elements
<customerID>1187594</customerID>
<customerID>238461</customerID>

  2 record(s) selected

As you can see, also the "customerID" of the customer who has not made a comment, including the stemmed form of "disappointed," is returned.

Similarly, the same query can be expressed in an SQL context as follows:

Listing 25. Same query expressed in an SQL context
SELECT XMLQUERY('$co//customerID' passing comments as "co")
FROM product
WHERE CONTAINS(comments, 'SECTION("/comments/comment/message")
                          STEMMED FORM OF "dissapointed"')=1

The returned sequence looks like:

<customerID>1187594</customerID><customerID>238461</customerID>

  1 record(s) selected.

Again, both "customerID" elements are returned by the search query, as search always returns the complete XML document. This aspect is important to consider when working with XML search results. However, what can you do to retrieve the single "customerID" you are looking for?

In cases where no advanced search functionality, like stemming, is used, it is possible to use a workaround using the contains() XQuery built-in function that was previously introduced.

The following query uses full-text search information to filter those documents having the term "disappointing" in the "message" XML element. The search for those documents returns the complete "comments" documents that are satisfying the search condition. In contrast to the previous queries, the following query adds a predicate on the message element using the contains() XQuery function. That is, those documents satisfying the full-text search criteria are additionally filtered based on the XPath predicate.

Listing 26. Full-text search criteria additionally filtered based on the XPath predicate
SELECT XMLQUERY('$co/comments/comment[contains(message,"disappointing")]
                    /customerID' passing comments as "co")
FROM product
WHERE CONTAINS(comments, 'SECTION("/comments/comment/message")
                          "disappointing"')=1;

The same query as above can also be expressed in an XQuery context:

Listing 27. Same query expressed in XQuery context
XQuery for $co in db2-fn:sqlquery('SELECT comments FROM product
         WHERE CONTAINS(comments,
        ''SECTION("/comments/comment/message") "disappointing" '')=1')
       return $co/comments/comment[contains(message,"disappointing")]/customerID

Both queries return the desired "customerID" element of the customer who has made the particular comment.

<customerID>238461</customerID>

  1 record(s) selected

XML namespace considerations

In cases where the XML data belongs to a specific non-default namespace, the fully qualified name (name space prefix + element name) has to be used when searching in specific elements or sub-trees in the document structure.

Assume that the elements of the XML product data belong to a namespace with prefix "ns" as follows:

Figure 4. XML product data with namespace
XML product data with namespace

Referring to a previous search example, Net Search Extender does not find any results in this case:

Listing 28. Net Search Extender finds no results
SELECT pid
FROM product
WHERE CONTAINS(description,
           'SECTION("/product/description/details") "pattern"')=1;

The XPath expression used to define the XML part to search in does not contain the namespace prefix -- the fully qualified element name (QName), in other words.

You must use the fully qualified XML element names in the SECTION parameter to find our sample document:

Listing 29. Use the fully qualified XML element names in SECTION parameter
SELECT pid
FROM product
WHERE CONTAINS(description,
           'SECTION("/ns:product/ns:description/ns:details")
           "pattern"  IN SAME SENTENCE AS "sugar and creamer"')=1;

Net Search Extender matches element/attribute names as they appear in the document, which includes any namespace prefix. However, if you are using a default namespace, you must not prefix the elements!

A method to configure how to address individual sections within an XML document structure is realized with a Net Search Extender document model. Its use is explained in the following section.

The NSE document model demystified

You can use a Net Search Extender document model to configure the scope of search within structured documents, such as XML. A document model primarily controls what parts of a document's structure are indexed, how these parts are indexed, and by which name you can refer to these document parts during search.

There are two types of document models:

  • The default document
  • A user defined custom document model

To use a custom document model, you have to specify one at text index creation time. If no custom document model is supplied as index parameter, Net Search Extender uses its default document model.

So far, the examples in this article worked with the default document model supplied by Net Search Extender. Characteristics of the default document model are:

  • All parts of the document are indexed (for XML documents, this includes all elements and attributes)
  • Search can be refined to specific document parts using an XPath-like syntax
  • No numeric values are indexed, thus, numeric comparisons or range searches aren't supported

Using a custom document model allows for:

  • Defining which parts of the XML document shall be indexed, and which parts should be excluded
  • Defining custom names for certain XML sub-trees, XML elements, or attributes
  • Defining XML element or attribute content to be numeric, allowing for numeric comparisons or range searches on this content

The document model itself is an XML document following a pre-defined grammar (for the grammar definition, see the Net Search Extender documentation). The document model, defined in a file, specifies a model for parsing and indexing structured documents of format XML, for example. The model file is passed as a parameter during index creation and is only read during index creation; later changes do not affect existing indexes.

Figure 5 defines a custom document model on the basis of the product data (refer to Listing 1). The document model is stored in a file nsemodel.xml.

Figure 5. A custom document model
NSE custom document model

Every XML document model starts with the "XMLModel" element. Two XML child elements are allowed for XMLModel:

Table 2. NSE XML modelfile elements
Modelfile elementDescription
XMLFieldDefinitionThis element defines a custom name for a specific part, element, or attribute of the XML document that is identified using the "locator" attribute of the "XMLFieldDefinition"
XMLAttributeDefinitionThis element defines an NSE attribute based on an XML element or XML attribute that can be used for numeric (parametric) searches

The locator attribute

Both model file elements, "XMLFieldDefinition" and "XMLAttributeDefinition," identify the part of the source XML document with the "locator" attribute. The locator attribute supports the child axis (/), descendant-and-self-axis (//), and attribute axis (@).

Moreover, a model file supports also the use of wildcards (*) in the specification of locations. Example (1) in Listing 30 defines the locator attribute that identifies all child elements of the "product" element. In example (2), the location attribute refers to all child elements of "product" belonging to the specific namespace "ns."

Also, specific node types can be identified by a locator attribute. Net Search Extender supports comment nodes (example 3 in Listing 30) and processing-instruction (PI) modes. Additionally, a union of elements can be defined as shown by example 4.

Listing 30. Examples for locator attribute values
(1) locator="/product/*"
(2) locator="/product/ns:*"
(3) locator="/product/comment()
(4) locator="name | details"

To summarize, the locator attribute defines the document part that should be indexed. You can use a sub-set of XPath expressions to specify locator attributes:

  • Child axis ('/' or 'child::')
  • Descendent-and-self axis ('//')
  • Attribute axis ('@' or 'attribute::')
  • Comment type node test ('comment()') and PI type node test ('processing-instruction()')
  • Union of nodes ('|')
  • Wildcard ('*')

The name attribute

The "name" attribute of an "XMLFieldDefinition" or "XMLAttributeDefinition" element defines the name used to refer to the part of the XML document structure identified by the value of the "locator" attribute.

For the definition of names, three special variables are available that support an automatic generation of a name value:

Table 3. Name attribute special values
Attribute valueDescription
name="$(NAME)"Represents the qualified name (QName) of the XML element or attribute that is identified by the XPath expression of the locator attribute
name="$(LOCALNAME")Represents the local name (without a namespace prefix) of the XML element that is identified by the XPath expression of the locator attribute
name="$(PATH)"Represents the absolute path to the XML element or attribute identified by the XPath expression of the locator attribute

Bearing the characteristics of the default document model in mind, it is defined as:

Figure 6. Definition of the default document model
Definition of the default document model

The type attribute

The "type" attribute can only be used in combination with an "XMLAttributeDefinition." The only value allowed is NUMBER, specifying that the element or attribute identified by the locator attribute can be used for parametric search.

Index definition with a custom document model

Once you have defined a custom model file, it can be used for index creation. The definition of a text index with a custom document model looks as follows:

Listing 31. Defining a text index with a custom document model
db2text create index ix2 for text on product(description)
                   format xml documentmodel XMLModel in nsemodel.xml

The document model parameter specified at index creation time specifies the root element in the model file. In case of XML, the root element is <XMLModel>. Additionally, you need to specify the location (filename of the model file).

After you create the new index "ix2" based on a custom document model and update the newly created index, you can start searching the document structure based on the rules outlined in the document model.

Search in the structure with a custom document model

According to the characteristics of the custom document model, the part of the document structure where search shall occur is identified by the name, as defined in the document model. Any other document parts that are not identified in the document model by corresponding locator attributes are not indexed and therefore cannot be searched.

For example, the following query returns a result by limiting search to the document part identified by name "description" in the document model of Figure 5:

Listing 32. Query results
SELECT pid
FROM product
WHERE CONTAINS(description,'SECTION("description") "pattern"')=1;

However, issuing the following query, based on an index defined with the custom document model in Figure 5, will return no results, although the same part of the document is referenced:

Listing 33. Query results
SELECT pid
FROM product
WHERE CONTAINS(description, 'SECTION("/product/description/details") "pattern"')=1;

Parametric search

Based on the "XMLAttributeDefinition" in the model file in Figure 5, parametric search on the values of element "price" is possible. The following query finds products that have a price ranging between 50 and 120:

Listing 34. Parametric search
SELECT pid
FROM product
WHERE CONTAINS(description, 'ATTRIBUTE "price" BETWEEN 50 AND 120')=1;

What if you would like to find products with a price larger than 120 US dollars, taking also the currency into account? Looking at the sample XML document in Figure 1, the currency is available as an attribute of XML element "price." In addition to range searches using the keyword "BETWEEN," parametric search is also possible with the operators GREATER THAN ('>') and LESS THAN ('<'). Taking the currency into account requires the use of the XMLEXISTS() SQL/XML function:

Listing 35. Using the XMLEXISTS() SQL/XML function
SELECT pid
FROM product
WHERE CONTAINS(description, 'ATTRIBUTE "price" > 120')=1 AND
      XMLEXISTS('$prod/product/description/price[@currency="us"]'
      passing description as "prod");

Summary

DB2 Net Search Extender provides a flexible query language to find relevant information based on complex retrieval operations involving Boolean combination, wildcard, and fuzzy search, for example. Thereby, search can be limited to specific XML elements or XML attributes in the document structure. Exploiting the power of a Net Search Extender custom document model allows for additional configuration of structure-aware search in DB2.


Acknowledgements

Thanks to Cindy Saracco, Bert van der Linden, Stefan Momma, and Michael Haide for their reviews of this article.

Resources

Learn

Get products and technologies

  • Download the DB2 9 test drive and Net Search Extender to try out the query techniques described in this article.
  • Build your next development project with IBM trial software, available for download directly from developerWorks.

Discuss

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Information management on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Information Management, XML
ArticleID=144053
ArticleTitle=XML full-text search in DB2
publish-date=06292006