Skip to main content

skip to main content

developerWorks  >  Information Management | XML  >

Faceted navigation for document discovery

Using metadata for better search

developerWorks
Document options
PDF format - Fits A4 and Letter

PDF - Fits A4 and Letter
518KB

Get Adobe® Reader®

Document options requiring JavaScript are not displayed


Rate this page

Help us improve this content


Level: Introductory

Alan Marwick (marwick@us.ibm.com), Technical Competency Lead, Knowledge Management, IBM US Federal CTO office, IBM 

14 Feb 2008

While there are several different ways for a user to specify metadata conditions, this article discusses one that has special advantages: faceted navigation. Follow the faceted navigation system described in this article, a technology demonstrator based on IBM® Omnifind™ Discovery Edition that exploits the XML capabilities of IBM DB2®, to explore the advantages of faceted navigation, and see how to get the maximum benefit from metadata creation.

Better search with faceted navigation

Text search is one of the most important ways that users of enterprise content can find the documents they need. Unfortunately, there are a number of reasons why enterprise text search systems often work less well than search of the public Internet (Enterprise Search: Tough Stuff, Rajat Mukherjee and Jianchang Mao. ACM Queue vol. 2, no. 2, April 2004). The main reason is that most enterprise content isn't cross-linked, so the search system doesn't have the page rank information that identifies "good" pages on a topic. On the other hand, Internet search doesn't make much, or any, use of metadata because most Internet content has minimal or no metadata. In contrast, enterprise content often has metadata associated with it. In fact, many organizations are significantly investing in the creation of standardized metadata, such as the Defense Discovery Metadata Standard (DDMS), defined by the US Department of Defense, and the Cross-Enterprise Document Sharing (XDS) framework, created by the IHE, a health industry consortium. (See Resources for more information on each of these standardized metadata methods.)

Thus there is an opportunity to improve search within enterprises by using metadata. Metadata is helpful because it allows users to specify conditions that any retrieved document must meet. For example, in addition to a keyword search query such as "diesel pollution" or "culvert bomb Anbar Province," users might use metadata conditions to specify that they are only interested in documents created in the last three months by a specific author. Typical Internet search systems can't do that.

While there are several different ways for a user to specify metadata conditions, this article is about one that has special advantages: faceted navigation. The faceted navigation system described here, nicknamed Croton, is a technology demonstrator based on IBM® Omnifind™ Discovery Edition and exploits the XML capabilities of DB2. The whole Croton system runs in a laptop.



Back to top


Example of faceted navigation

The faceted navigation search interface of Croton is shown in Figure 1. At the top is the familiar search box, into which the user has entered her query: "cafe". On the left, under the heading "Refine by," are the facets "Incident Date" and "Geography". Each of these is the head of a different hierarchy of values, which can be expanded as a tree in the user interface. The user has selected "Iraq" then "Kirkuk". The final effect is to limit the documents in the results list to those that have "cafe" in their text and whose Geography facet has the value "Geography > Middle East and Persian Gulf > Iraq > Kirkuk".

This example illustrates three key features of faceted navigation:

  1. The user selects metadata conditions by clicking on values presented by the application.
  2. Only metadata values that lead to documents are presented. A user will never get an empty results list by clicking on a metadata condition.
  3. The metadata is organized in several independent categories, called facets, each of which can take a number of values. The values can be organized hierarchically in a taxonomy or in some other way, such as grouping dates into ranges, as in the example.

Figure 1. Faceted navigation application
Faceted navigation           application


Faceted navigation should already be familiar to anyone who has browsed catalogs on the Internet. Catalog items are essentially described only by metadata; in fact, if there's a search box in such an application, it often only searches metadata. In the present case, the items being searched are documents that are described both by metadata and content, but the basic user interface is the same as in the e-commerce case.

The familiarity of faceted navigation, as well as its ease of use by untrained searchers, are significant advantages. Other techniques for incorporating metadata conditions in searches are widely used but have some disadvantages. A common technique is to expand the query language to allow the metadata conditions to be specified as part of the query string. This was the approach used in the first generation of bibliographic search engines, such as IBM's STAIRS. Because of the complexity of the query language and the need to understand the data schema in order to compose the queries, training is required to use these applications. Typically, such applications are used by librarians and other specialists.

Another way to incorporate metadata conditions is to use a query form, whose fields correspond to the elements of the metadata schema. Experience shows that the form is rarely used if there is a simpler interface available. This is consistent with studies of searches by untrained users, such as Web searches, which show that fewer than 10% of queries use advanced features (A Spink, D Wolfram, M. Jansen, and T. Saracevic "Searching the Web: The public and their queries". J. American Society for Information Science and Technology, vol 52, 2001, pp. 226-234).

Both of these approaches suffer from the further disadvantage that it is easy to compose a query that returns no results. This is confusing for users who don't know which conditions of their query to relax in order to get some results.

A faceted navigation approach overcomes these disadvantages. The query language is simple and compatible with Internet search engines. The structure of the metadata schema is explicit, as in a form, but the user doesn't see the detail of any of the facets that he is not interested in. Lastly, it is not possible to select metadata conditions that return zero results because they are not presented to the user. Instead, the user can see how many documents will be left in the results list if that condition is selected (see Figure 1).

Faceted navigation has a further advantage. From studies of user behavior, we know that people prefer to use an approach to searching that can be described as successive refinement (A Spink, T.D. Wilson, N. Ford, A Foster and D. Ellis "Information seeking and mediated searching study: Part 3. Successive searching". J. American Society for Information Science and Technology, vol 53, 2002, pp. 716-727). That is, they issue an initial broad query that reassures them that the search system gives access to at least some documents in the area they are interested in. They then add more precise conditions until they get what they are looking for. Faceted navigation supports this style of querying very well, which is another reason why it is widely used in catalog searching on e-commerce sites.

The rest of article describes a proof-of-concept demonstration of a faceted navigation system to search a collection of documents and its metadata, using IBM Omnifind Discovery Edition. The content used in the demo is a collection of terrorism reports available in XML, which we will store as native XML in DB2 V9. Building this system illustrates the main concepts of faceted navigation in a concrete way and creates a demo that can run in a laptop for evaluations and demonstrations.



Back to top


The WITS document set

As examples of documents that have extensive metadata associated with them, let's use the WITS document collection. WITS stands for "Worldwide Incident Tracking System," which is a database of terrorist incidents maintained by the US National Counter Terrorism Center. (See Resources for more information on WITS.) The NCTC makes an XML file with 28,752 incident reports available for download. An excerpt from one incident report is shown in Listing 1:


Listing 1. An excerpt from a WITS XML document
                      
<IncidentList>
 <Incident>
    <ICN>200458431</ICN>
    <Subject>
      10 civilians killed, at least 45 wounded by suspected GAM in
      Peureulak, Indonesia
    </Subject>
    <Summary>
      On 1 January 2004, in Peureulak, Aceh Province, Indonesia, a
      bomb exploded at a concert, killing ten civilians, wounding 45
      others, and causing major damage to the stage area. Many of the
      victims were Indonesian teenagers. Police blamed the Free Aceh
      Movement (GAM), although the GAM denied responsibility. No 
      other group claimed responsibility.
    </Summary>
    <IncidentDate>01/01/2004</IncidentDate>
    <Location>
      <Region>East Asia-Pacific</Region>
      <Country>Indonesia</Country>
      <CityStateProvinceList>
        <CityStateProvince>
          <City>Peureulak</City>
        </CityStateProvince>
      </CityStateProvinceList>
    </Location>
 ...
 </Incident>
...
</IncidentList>
      

As Listing 1 shows, a WITS incident report contains both text content (the Subject and Summary elements), as well as structured metadata such as a unique incident number (ICN), date, location information, and others not shown. You can use this structured metadata for faceted navigation.



Back to top


Searching the WITS collection with IBM Omnifind Discovery Edition

IBM Omnifind Discovery Edition (see Resources) includes both a text search engine and a faceted navigation engine. It also provides crawlers that can ingest content from databases and XML files, as well as Web pages. To demonstrate faceted navigation, we built a proof-of-concept system using Version 8.4 of Discovery Edition, which comes with the Apache Tomcat 5.0 Web server. These were both installed in a laptop.

Let's first take a look at how Discovery Edition was configured to search and navigate the WITS collection using the default user interface. Then in a later section, see how the user interface can be improved for document search by using a tree control for displaying and selecting the metadata conditions.

Defining a collection

As with any search project, our faceted navigation demo requires that we define content sources and configure a crawler or other device to ingest and index them. As part of this process, you must tell the search engine which parts of a document correspond to features, like the title of the document, the body text, and anything else you want the search engine to index or display. Furthermore, since faceted navigation deals with metadata as well as content, you also need to tell the engine where to find the metadata values for each document, and you must specify the data model for the metadata. With Omnifind Discovery Edition, you do this by defining features to hold metadata values such as Incident Date and the name of the city within which the incident occurred. You also must specify a data type for each feature. This is done using the Management Console tool that comes with the product. If a feature's value is actually defined by a hierarchy or tree of values, in other words, if it is a taxonomy feature, you specify that too. The features used in our demo are shown in Table 1. Other features can be defined as required, but this minimal set, as shown in Table 1, is enough to demonstrate faceted navigation:


Table 1. Features used in the Croton demo
Feature name Type Description
ICN Text Identifier from the <ICN> element
Subject Text Short description of the incident
Description Text Full description of the incident, from the <Summary> element
IncidentDate DateTime Date of the incident
Region Text Geographical region
Country Text Country within the region
StateProvince Text State or province within the country
City Text City within the state or province
Geography Taxonomy Created from Region>Country>City

After defining the features of the collection, you can specify the XML file containing the WITS incident data as a content source for the collection, and define how the features of each document are to be extracted from the XML. This is done by writing an XPath expression for each feature. For example, the XPath expression to define the Subject feature in the WITS XML file is IncidentList/Incident/Subject/text().

Defining a taxonomic feature

If you were to leave Region, Country, StateProvince, and City as independent features, each would show up in the user interface as a separate facet of the metadata. But they are not independent; they are closely linked because regions contain countries, cities lie within states or provinces, and so forth. A much better idea is to link these geographic features into one hierarchy to help the user see how they are related. We defined the Geography feature with the Taxonomy data type for this purpose. A taxonomy display is a particularly powerful way to display complex data models in a user interface. The Geography feature's value for a given document needs to be defined from the values of several other features that are extracted in turn from the Location element in the WITS data model (see Listing 1). To create the value of the Geography feature for each document, we use the ability of Omnifind Discovery Edition to define a metadata rule. The value of the Geography feature becomes ${Region}: ${Country}: ${City}, where the colons separate the levels of the taxonomy. (We have left out the StateProvince level of the taxonomy because many incident reports don't specify it.) As an example of the effect of this metadata rule, the value of the Geography feature for the incident, illustrated in Figure 1, becomes "East Asia-Pacific: Indonesia: Peureulak".

Continuous variables can also be cast into the appearance of taxonomic features by organizing their value into ranges, which can then be further subdivided to create a hierarchy. Omnifind Discovery Edition can be configured to do this automatically. An example for dates is shown in Figure 2. This shows values of the Incident Date facet that all fall within the year 2006. The dates have been automatically grouped into six two-month ranges, which then form the next level of the hierarchy. If the user were to select one of these ranges, then the next display would show months and ranges of dates within a month. This approach can be used to create a hierarchy for any continuous data.


Figure 2. Date facet portrayed as a hierarchy, similar to a taxonomic feature
Date facet portrayed           as a hierarchy, similar to a taxonomic feature


At this point, having defined a collection and its features, and having ingested the WITS XML file and populated the features for each Incident document with data, we can use the default user interface within the Management Console, as shown in Figure 3. While this still needs work, it already shows the main features of faceted navigation. In Figure 3 the user has searched for "truck", and Omnifind Discovery has returned 103 documents. The metadata values for those documents are displayed and the user has the opportunity to use them to refine the search. There is a tabular display of the returned documents.

While we could tune this interface using the Management Console, for this demo system, we want to make some significant changes. The tabular display of documents is different from the norm in document search. More significantly, since only the most selective metadata values are displayed to conserve real estate on the user interface, the user has no way to explore the available values without actually selecting them and thus adding them to the query. There is no available action that simply explores the hierarchy of values. To permit a user to do these things, you need a new user interface.


Figure 3. The default user interface
The default user interface




Back to top


An improved user interface

To give a user more flexibility in exploring the metadata facets, as well as to get better control over the details of the results list, we will replace the user interface (UI) of Figure 3 with another, illustrated in Figure 4. This UI is implemented with Java Server Pages (JSPs) by modifying the Tabbed Navigation interface that is supplied with Omnifind Discovery Edition 8.4.

We will make two main modifications to the Tabbed Navigation interface. The first is to replace the display of "Refine By" (metadata) options with a tree control. This will enable easier exploration of the available values. The second is to extend the search results list so that a user can click on the title of an incident report and see the full report. That will make the results page resemble the de-facto standard for document search.


Figure 4. An alternative user interface based on tree navigation, implemented with JSPs
An alternative user           interface based on tree navigation, implemented with JSPs


A tree control, as shown on the left side of Figure 4, is a compact, yet dynamic, way to display the available metadata values in the different facets. Each branch of the tree corresponds to a facet of the metadata. This allows several facets to be displayed in a compact way. Also, the metadata conditions can be easily explored by opening and closing the subtrees without selecting them. Facets that are not of interest to the user can be left un-expanded, and don't take up room on the screen. This approach, therefore, meets our goals of an intuitive interface, while allowing complex metadata models with many facets to be easily explored.

To program our user interface, we don't have to start from scratch; instead we can modify one of the JSP-based UIs that are supplied with Omnifind Discovery Edition 8.4. We'll use the TabbedNav interface. It is supplied as a WAR file, which can be installed into a development environment, such as Rational Application Developer (RAD), and modified to suit our purpose. Figure 5 shows the different parts of the completed application, all of which can run in one laptop. A Tomcat Web server hosts the JSPs, which rely on a Java library that is part of the Discovery product. These libraries, in turn, use Web services provided by the Omnifind Discovery server. The XML file of terrorist incident reports is ingested both by the Discovery server, as already described, and is loaded into an XML database, DB2 9, from which they can be retrieved by the JSP that does document display. That, in turn, transmits them as XML to the browser with an XSL file that formats the XML into HTML for display. All these components will be briefly described.


Figure 5. Architecture of the faceted navigation application
Architecture of the faceted           navigation application


A tree control

We begin with the tree control for displaying the metadata. You want one that responds quickly when the user opens and closes sub-trees. The Treeview control, available online (see Resources), is suitable: it executes in the browser in JavaScript, which means it is responsive, and the JavaScript can be built on the server side by writing the JavaScript from within a JSP. To create the tree, replace the file refineByList.jsp that is part of the TabbedNav code with a new file, refineByTree.jsp, that is closely modeled on refineByList. In it, build the tree hierarchy by using calls to the Discovery Edition Java libraries, which in turn communicate with the Discovery server using Web services. For each node in the tree, define a hyperlink that, if clicked on by the user, signals to the Discovery to apply a metadata condition.

Building the tree hierarchy

The JavaScript to create a top-level folder, such as "Geography", using the Treeview control is:

folder0 = 
   insFld( foldersTree, gFld( "Geography", "javascript:undefined"));

Here, folder0 is a unique name given to the top-level folder, insFld and gFld are Treeview functions, foldersTree is the name of the tree, and “Geography” is the label on the folder. The last argument of the function gFld is the JavaScript function called when the user clicks on the folder; since "Geography" is a top-level folder, you don't want the click to do anything, and the function is a no-op. There's additional code (not shown) to consistently assign a different name to each node in the tree. By declaring this name as an "external ID" to the Treeview control, this enables it to maintain its state in such a way that the appearance of the tree remains the same when the results page is redrawn. From a user's perspective, this provides more natural look and feel.

The Java code to loop through the top-level features is shown in Listing 2, below. Note the calls to methods, like DrillDown.getTallyFeatures() and TallyFeature.getLabel() from the Omnifind Discovery Java library.


Listing 2. Java code to populate the tree with top-level features
                
DrillDown drillDown = resultSet.getDrillDownPlus();
StringBuffer buf = new StringBuffer(); // Holds HTML

// Loop through top-level features
for (int j = 0; j < drillDown.getTallyFeatures().length; j++){
  TallyFeature toplevelFeature = drillDown.getTallyFeatures()[j];
  String nodeName = "folder" + j;

  String label = toplevelFeature.getLabel(); // e.g. 'Geography'

  // Code to emit Treeview JavaScript goes here 

  // Add the sub-tree for the top level feature.
  addSubTree( buf, pageContext, toplevelFeature, nodeName);

} // end for j
      

For each top-level feature, or facet, there is a sub-tree of possible values. The Java function addSubTree in refineByTree.jsp creates the JavaScript for the sub-tree and writes it into the buffer. One complication is that the Discovery Java APIs only allow you to get a list of the nodes in the sub-tree and their recursion levels, so you have to reconstruct the sub-tree from this information. For a given entry in the list, you can find out if it has children by looking ahead to see if the following entries have a higher recursion level. If so, then the current entry must be a folder, and you push its name and recursion level onto an auxiliary stack as well as adding it to the tree. Then, when you later find a list entry with a lower recursion level, you can determine which folder that entry belongs in by popping folders off the stack until you find one whose recursion level is less than that of the list item.

Proceeding in this way, the function addSubTree emits Treeview JavaScript to create either a folder node or a child node in much the same way as for a top level folder. The main difference is that, now, each node in the tree can invoke a JavaScript function if the user clicks on it, so that the corresponding metadata condition can be added to the user's current query. The function javascript:drillerDownMenus is supplied by a JavaScript library that is part of the Discovery package. The code for refineByList.jsp provides examples of its use. One further complication is that the Treeview APIs require different delimiters for the arguments of drillerDownMenus, depending if it is supplied as an argument for insFld, which creates a folder (the delimiters must be double quotes), or as an argument for insDoc, which creates a leaf node (the delimiters must be single quotes).

Finally, adding some hover text to the tree nodes and specifying the CSS style to be used in rendering them results in a tree control with a usable, consistent look and feel, as shown in Figure 4. To satisfy the conditions of use of the Treeview control for a demonstration application, a title with a link to the Treeview Web page also has to be included in the output and can be seen in Figure 4.



Back to top


Viewing a document

To complete our improved user interface, we want the title of an incident report in the search-results list to link to a copy of the report. This would be easy if the report were an existing Web page, but in this case, it is an XML fragment buried in the WITS XML file. We need a tool that can extract the content of the Incident element with a given ICN number from the file.

Using DB2 to store and query XML data

The tool you use is DB2, which we installed in the same laptop that hosts the other components of the demo. Since Version 9, DB2 has been able to store XML documents and return them, or parts of them, in response to a query. We store the WITS XML file in a table, CROTON_DATA, whose schema is shown in Figure 6:


Figure 6. The schema of the CROTON_DATA table
The schema of the           CROTON_DATA table


This table has only one row. The whole WITS XML file is stored in the XML column. If you were dealing with more than one XML file, for example, with report sets from different periods, you would have additional rows. You will see in a moment how the individual Incident elements can be selected out of the XML content in the table.

To populate the table, simply import the file using an SQL command, as shown in Listing 3 (it might be necessary to increase the size of the DB2 log file first):


Listing 3. Import file using SQL command
                
IMPORT FROM "[path]\Croton\Datasets\incidents.del" 
OF DEL XML FROM "[path]\Croton\Datasets" 
METHOD P (1, 2) MESSAGES "c:\msg.txt" 
INSERT INTO MARWICK.CROTON_DATA (NAME, "DATA");

where the file paths have been simplified. The content of the file incidents.del that the command references is just:

"WITS","<XDS FIL='wits.xml'/>" 

Of course, wits.xml is the name of the file containing all the incident reports, of which a snippet was shown in Listing 1.

When DB2 9 imports the XML file, it parses and indexes the XML to allow XML queries to be executed against it. DB2 9 contains an XML database engine that works alongside the SQL database engine, and that makes XML queries very efficient.

To retrieve a single incident report from the CROTON_DATA table requires just an XML query. Listing 4 illustrates an example query:


Listing 4. Retrieve a single incident report
                
xquery
for $Incident in 
db2-fn:xmlcolumn( "CROTON_DATA.DATA")/IncidentList/Incident
where $Incident/ICN="200458437"
return $Incident; 

This query returns the content of the requested Incident as an XML fragment. Listing 5 shows how the XML query is issued from Java, using JDBC and an existing database connection instance, just as if it were a normal SQL query:


Listing 5. Java to return XML data by issuing an XML query with JDBC
                
String result;
Statement stmt = conn.createStatement();
ResultSet rs = stmt.executeQuery( query);
if( rs.next()) {
      result = rs.getString( 1); // return XML fragment
}
        

Returning XML data to Croton

The database access code is packaged as a session bean, XMLData, within the Croton application. The bean is used by a new JSP, showIncident, that is invoked when the user clicks on an incident title in the search results list, using a link of the following form:

http://localhost:8080/Croton/showIncident.jsp?icn=200458437.

To construct this URL when the results list page is built, the value of the ICN feature for each result list item is obtained from the Omnifind Discovery Edition server using Java library functions. When the user clicks on the link, the showIncident JSP is invoked, and it retrieves the value of the ICN feature from its request object. It then uses an XMLData bean to retrieve the XML incident data into a Java String by using the XML query already described. Finally, the incident data XML is returned to the browser by showIncident.

But there is one last additional step. The browser may not do a good job of rendering the raw XML. As a further courtesy to the user, you can use a style sheet to convert the raw XML into a well-formatted HTML page. The style sheet specifies how the XML is mapped to HTML. It drives a transformation engine in the user's browser. We use a simple style sheet that does two things: (1) it defines presentation styles, such as font and color, for use in the HTML output; and (2) it creates HTML output that uses those styles and applies them to data from the raw XML. The second step is illustrated in the following code snippet from the style sheet:

<tr>
     <td class="FacetMajorHeader">Subject</td>
     <td><xsl:value-of select="//Incident/Subject"/></td>
</tr>

Here, the content of the Subject element is being rendered in a table row. The title ("Subject") is rendered with the style assigned to the element class FacetMajorHeader. The select attribute of the xsl:value-of element shows the XPath expression that selects the content of the Subject element in the XML and thus maps from the XML to the HTML defined by the style sheet. The content is rendered with the containing table's style, not shown.

It only remains to insert a reference to the style sheet at the beginning of the XML returned by showIncident.jsp. The result is formatting like that shown in Figure 7.


Figure 7. An XML incident report formatted with a style sheet
An XML incident report           formatted with a style sheet


This is easier for the user to read than the raw XML, although, for demo purposes, a link to the raw version of the XML is included on the page. The style sheet includes code to handle features of the XML like repeating elements in the XML, lists of element values, and so forth, though the example shown doesn't include any of these.



Back to top


Conclusion

This article started by pointing out that enterprise search, particularly where there is metadata available, differs significantly from Internet search, and that a different approach is needed to satisfy users' needs. The Croton demo described in this article illustrates how faceted navigation enables text search conditions and metadata conditions to be combined in a natural way, and thus allows users to get the maximum benefit from an organization's investment in metadata creation. The demo also illustrates how a standards-based approach, based on XML, XPath query syntax, and Extensible Stylesheet Language, is made possible by the XML capabilities of DB2 9 and IBM Omnifind Discovery Edition. The overall solution is simplified by using the original XML schema, which combines both content and metadata, for both indexing and storage of the data.



Resources

Learn

Get products and technologies
  • Treeview: Download Treeview.

  • DB2 9: Download DB2 9.

  • Build your next development project with IBM trial software, available for download directly from developerWorks.


Discuss


About the author

author photo

Alan Marwick has 15 years experience with text search and analysis techniques. In his present role, he looks for ways to apply IBM technology to the challenges faced by US Federal Government departments. Previously, he was in IBM Research, initially as a physicist, then leading teams working on knowledge-management technology. He has a PhD in physics from the University of Sussex in the UK, has published extensively in physics and computer science, and holds several patents.




Rate this page


Please take a moment to complete this form to help us better serve you.



 


 


Not
useful
Extremely
useful
 


Share this....

digg Digg this story del.icio.us del.icio.us Slashdot Slashdot it!



Back to top