 | Level: Introductory Alan Marwick (marwick@us.ibm.com), Technical Competency Lead, Knowledge Management, IBM US Federal CTO office, IBM
14 Feb 2008 While there are several different ways for a user to specify metadata
conditions, this article discusses one that has special advantages: faceted
navigation. Follow the faceted navigation system described in this article, a technology demonstrator
based on IBM® Omnifind™ Discovery Edition
that exploits the XML capabilities of IBM DB2®, to explore the advantages of
faceted navigation, and see how to
get the maximum benefit from metadata creation.
Better search with faceted navigation
Text search is one of the most important ways that users of enterprise content
can find the documents they need. Unfortunately, there are a number of reasons why
enterprise text search systems often work less well than search of the public
Internet (Enterprise Search: Tough Stuff, Rajat Mukherjee and Jianchang
Mao. ACM Queue vol. 2, no. 2, April 2004).
The main reason is that most enterprise content isn't cross-linked, so the search
system doesn't have the page rank information that identifies "good" pages on a
topic. On the other hand, Internet search doesn't make much, or any, use of
metadata because most Internet content has minimal or no metadata. In contrast,
enterprise content often has metadata associated with it. In fact, many
organizations are significantly investing in the creation of standardized
metadata, such as the Defense Discovery Metadata Standard (DDMS), defined by the
US Department of Defense, and the Cross-Enterprise Document Sharing (XDS)
framework, created by the IHE, a health industry consortium. (See
Resources for more information on each of these
standardized metadata methods.)
Thus there is an opportunity to improve search within enterprises by using
metadata. Metadata is helpful because it allows users to specify conditions that
any retrieved document must meet. For example, in addition to a keyword search
query such as "diesel pollution" or "culvert bomb Anbar Province," users might use
metadata conditions to specify that they are only interested in documents created
in the last three months by a specific author. Typical Internet search systems
can't do that.
While there are several different ways for a user to specify metadata conditions,
this article is about one that has special advantages: faceted navigation. The
faceted navigation system described here, nicknamed Croton, is a technology
demonstrator based on IBM® Omnifind™ Discovery Edition and exploits
the XML capabilities of DB2. The whole Croton system runs in a laptop.
Example of faceted navigation
The faceted navigation search interface of Croton is shown in
Figure 1. At the top is the familiar search box, into which
the user has entered her query: "cafe". On the left, under the heading "Refine
by," are the facets "Incident Date" and "Geography". Each of these is the head of
a different hierarchy of values, which can be expanded as a tree in the user
interface. The user has selected "Iraq" then "Kirkuk". The final effect is to
limit the documents in the results list to those that have "cafe" in their text
and whose Geography facet has the value "Geography > Middle East and
Persian Gulf > Iraq > Kirkuk".
This example illustrates three key features of faceted navigation:
- The user selects metadata conditions by clicking on values presented by the
application.
- Only metadata values that lead to documents are presented. A user will never
get an empty results list by clicking on a metadata condition.
- The metadata is organized in several independent categories, called
facets, each of which can take a number of values. The values can be
organized hierarchically in a taxonomy or in some other way, such as grouping
dates into ranges, as in the example.
Figure 1. Faceted navigation
application
Faceted navigation should already be familiar to anyone who has browsed catalogs
on the Internet. Catalog items are essentially described only by metadata; in
fact, if there's a search box in such an application, it often only searches
metadata. In the present case, the items being searched are documents that are
described both by metadata and content, but the basic user interface is the same
as in the e-commerce case.
The familiarity of faceted navigation, as well as its ease of use by untrained
searchers, are significant advantages. Other techniques for incorporating metadata
conditions in searches are widely used but have some disadvantages. A common
technique is to expand the query language to allow the metadata conditions to be
specified as part of the query string. This was the approach used in the first
generation of bibliographic search engines, such as IBM's STAIRS. Because of the
complexity of the query language and the need to understand the data schema in
order to compose the queries, training is required to use these applications.
Typically, such applications are used by librarians and other specialists.
Another way to incorporate metadata conditions is to use a query form, whose
fields correspond to the elements of the metadata schema. Experience shows that
the form is rarely used if there is a simpler interface available. This is
consistent with studies of searches by untrained users, such as Web searches,
which show that fewer than 10% of queries use advanced features (A Spink, D Wolfram, M.
Jansen, and T. Saracevic "Searching the Web: The public
and their queries". J. American Society for Information Science and Technology,
vol 52, 2001, pp. 226-234).
Both of these approaches suffer from the further disadvantage that it is easy to
compose a query that returns no results. This is confusing for users who don't
know which conditions of their query to relax in order to get some results.
A faceted navigation approach overcomes these disadvantages. The query language
is simple and compatible with Internet search engines. The structure of the
metadata schema is explicit, as in a form, but the user doesn't see the detail of
any of the facets that he is not interested in. Lastly, it is not possible to
select metadata conditions that return zero results because they are not presented
to the user. Instead, the user can see how many documents will be left in the
results list if that condition is selected (see Figure 1).
Faceted navigation has a further advantage. From studies of user behavior,
we know that people prefer to use an approach to searching that can be described
as successive refinement (A
Spink, T.D. Wilson, N. Ford, A Foster and D. Ellis "Information seeking and
mediated searching study: Part 3. Successive searching". J. American Society for
Information Science and Technology, vol 53, 2002, pp. 716-727). That is, they issue an initial broad query that
reassures them that the search system gives access to at least some documents in
the area they are interested in. They then add more precise conditions until they
get what they are looking for. Faceted navigation supports this style of querying
very well, which is another reason why it is widely used in catalog searching on
e-commerce sites.
The rest of article describes a proof-of-concept demonstration of a faceted
navigation system to search a collection of documents and its metadata, using IBM
Omnifind Discovery Edition. The content used in the demo is a collection of
terrorism reports available in XML, which we will store as native XML in DB2 V9.
Building this system illustrates the main concepts of faceted navigation in a
concrete way and creates a demo that can run in a laptop for evaluations and
demonstrations.
The WITS document set
As examples of documents that have extensive metadata associated with them, let's
use the WITS document collection. WITS stands for "Worldwide Incident Tracking
System," which is a database of terrorist incidents maintained by the US National
Counter Terrorism Center. (See Resources for more
information on WITS.)
The NCTC makes an XML file with 28,752 incident reports available for download. An
excerpt from one incident report is shown in Listing 1:
Listing 1. An excerpt from a WITS XML
document
<IncidentList>
<Incident>
<ICN>200458431</ICN>
<Subject>
10 civilians killed, at least 45 wounded by suspected GAM in
Peureulak, Indonesia
</Subject>
<Summary>
On 1 January 2004, in Peureulak, Aceh Province, Indonesia, a
bomb exploded at a concert, killing ten civilians, wounding 45
others, and causing major damage to the stage area. Many of the
victims were Indonesian teenagers. Police blamed the Free Aceh
Movement (GAM), although the GAM denied responsibility. No
other group claimed responsibility.
</Summary>
<IncidentDate>01/01/2004</IncidentDate>
<Location>
<Region>East Asia-Pacific</Region>
<Country>Indonesia</Country>
<CityStateProvinceList>
<CityStateProvince>
<City>Peureulak</City>
</CityStateProvince>
</CityStateProvinceList>
</Location>
...
</Incident>
...
</IncidentList>
|
As Listing 1 shows, a WITS incident report contains both text content (the
Subject and Summary
elements), as well as structured metadata such as a unique incident number (ICN),
date, location information, and others not shown. You can use this structured
metadata for faceted navigation.
Searching the WITS collection with IBM Omnifind Discovery Edition
IBM Omnifind Discovery Edition (see Resources)
includes both a text search engine and a faceted navigation engine. It also
provides crawlers that can ingest content from databases and XML files, as well as
Web pages. To demonstrate faceted navigation, we built a proof-of-concept system
using Version 8.4 of Discovery Edition, which comes with the Apache Tomcat 5.0 Web
server. These were both installed in a laptop.
Let's first take a look at how Discovery Edition was configured to search and
navigate the WITS collection using the default user interface. Then in a later
section, see how the user interface can be improved for document search by using a
tree control for displaying and selecting the metadata conditions.
Defining a collection
As with any search project, our faceted navigation demo requires that we define
content sources and configure a crawler or other device to ingest and index them.
As part of this process, you must tell the search engine which parts of a document
correspond to features, like the title of the document, the body text, and
anything else you want the search engine to index or display. Furthermore, since
faceted navigation deals with metadata as well as content, you also need to tell
the engine where to find the metadata values for each document, and you must
specify the data model for the metadata. With Omnifind Discovery Edition, you do
this by defining features to hold metadata values such as Incident Date and the
name of the city within which the incident occurred. You also must specify a data
type for each feature. This is done using the Management Console tool that comes
with the product. If a feature's value is actually defined by a hierarchy or tree
of values, in other words, if it is a taxonomy feature, you specify that too. The
features used in our demo are shown in Table 1. Other
features can be defined as required, but this minimal set, as shown in Table 1, is
enough to demonstrate faceted navigation:
Table 1. Features used in the
Croton demo
|
Feature name
|
Type
|
Description
| | ICN | Text | Identifier from the <ICN> element
| | Subject | Text | Short description of the incident | | Description | Text | Full description of the incident, from the
<Summary> element | | IncidentDate | DateTime | Date of the incident | | Region | Text | Geographical region | | Country | Text | Country within the region | | StateProvince | Text | State or province within the country | | City | Text | City within the state or province | | Geography | Taxonomy | Created from Region>Country>City
|
After defining the features of the collection, you can specify the XML file
containing the WITS incident data as a content source for the collection,
and define how the features of each document are to be extracted from the XML.
This is done by writing an XPath expression for each feature. For example, the
XPath expression to define the Subject feature in the WITS XML file is
IncidentList/Incident/Subject/text().
Defining a taxonomic feature
If you were to leave Region, Country, StateProvince, and City as independent
features, each would show up in the user interface as a separate facet of the
metadata. But they are not independent; they are closely linked because regions
contain countries, cities lie within states or provinces, and so forth. A much
better idea is to link these geographic features into one hierarchy to help the
user see how they are related. We defined the Geography feature with the
Taxonomy data type for this purpose. A taxonomy display
is a particularly powerful way to display complex data models in a user
interface. The Geography feature's value for a given document needs to be defined
from the values of several other features that are extracted in turn from the
Location element in the WITS data model (see
Listing 1). To create the value of the Geography feature for
each document, we use the ability of Omnifind Discovery Edition to define a
metadata rule. The value of the Geography feature becomes
${Region}: ${Country}: ${City}, where the colons
separate the levels of the taxonomy. (We have left out the
StateProvince level of the taxonomy because many
incident reports don't specify it.) As an example of the effect of this metadata
rule, the value of the Geography feature for the incident, illustrated in
Figure 1, becomes "East Asia-Pacific: Indonesia:
Peureulak".
Continuous variables can also be cast into the appearance of taxonomic features
by organizing their value into ranges, which can then be further subdivided to
create a hierarchy. Omnifind Discovery Edition can be configured to do this
automatically. An example for dates is shown in Figure 2.
This shows values of the Incident Date facet that all fall within the year 2006.
The dates have been automatically grouped into six two-month ranges, which then
form the next level of the hierarchy. If the user were to select one of these
ranges, then the next display would show months and ranges of dates within a
month. This approach can be used to create a hierarchy for any continuous data.
Figure 2. Date facet portrayed
as a hierarchy, similar to a taxonomic feature
At this point, having defined a collection and its features, and having ingested
the WITS XML file and populated the features for each Incident document with data,
we can use the default user interface within the Management Console, as shown in
Figure 3. While this still needs work, it already shows the
main features of faceted navigation. In Figure 3 the user has searched for
"truck", and Omnifind Discovery has returned 103 documents. The metadata values
for those documents are displayed and the user has the opportunity to use them to
refine the search. There is a tabular display of the returned documents.
While we could tune this interface using the Management Console, for this demo
system, we want to make some significant changes. The tabular display of documents
is different from the norm in document search. More significantly, since only the
most selective metadata values are displayed to conserve real estate on the user
interface, the user has no way to explore the available values without actually
selecting them and thus adding them to the query. There is no available action
that simply explores the hierarchy of values. To permit a user to do these things,
you need a new user interface.
Figure 3. The default user interface
An improved user interface
To give a user more flexibility in exploring the metadata facets, as well as to
get better control over the details of the results list, we will replace the user
interface (UI) of Figure 3 with another, illustrated in Figure 4. This UI is
implemented with Java Server Pages (JSPs) by modifying the Tabbed Navigation
interface that is supplied with Omnifind Discovery Edition 8.4.
We will make two main modifications to the Tabbed Navigation interface. The first
is to replace the display of "Refine By" (metadata) options with a tree
control. This will enable easier exploration of the available values. The second
is to extend the search results list so that a user can click on the title of an
incident report and see the full report. That will make the results page resemble
the de-facto standard for document search.
Figure 4. An alternative user
interface based on tree navigation, implemented with JSPs
A tree control, as shown on the left side of Figure 4, is a compact, yet dynamic,
way to display the available metadata values in the different facets. Each branch
of the tree corresponds to a facet of the metadata. This allows several facets to
be displayed in a compact way. Also, the metadata conditions can be easily
explored by opening and closing the subtrees without selecting them. Facets that
are not of interest to the user can be left un-expanded, and don't take up room on
the screen. This approach, therefore, meets our goals of an intuitive interface,
while allowing complex metadata models with many facets to be easily explored.
To program our user interface, we don't have to start from scratch; instead we can
modify one of the JSP-based UIs that are supplied with Omnifind Discovery Edition
8.4. We'll use the TabbedNav interface. It is supplied as a WAR file, which can be
installed into a development environment, such as Rational Application Developer
(RAD), and modified to suit our purpose. Figure 5 shows the different parts of the
completed application, all of which can run in one laptop. A Tomcat Web server
hosts the JSPs, which rely on a Java library that is part of the Discovery
product. These libraries, in turn, use Web services provided by the Omnifind
Discovery server. The XML file of terrorist incident reports is ingested both by
the Discovery server, as already described, and is loaded into an XML database,
DB2 9, from which they can be retrieved by the JSP that does document
display. That, in turn, transmits them as XML to the browser with an XSL file that
formats the XML into HTML for display. All these components will be briefly
described.
Figure 5. Architecture of the faceted
navigation application
A tree control
We begin with the tree control for displaying the metadata. You want one that
responds quickly when the user opens and closes sub-trees. The Treeview control,
available online (see Resources),
is suitable: it executes in the browser in JavaScript, which means it is
responsive, and the JavaScript can be built on the server side by writing the
JavaScript from within a JSP. To create the tree, replace the file
refineByList.jsp that is part of the TabbedNav code
with a new file, refineByTree.jsp, that is closely
modeled on refineByList. In it, build the tree
hierarchy by using calls to the Discovery Edition Java libraries, which in turn
communicate with the Discovery server using Web services. For each node in the
tree, define a hyperlink that, if clicked on by the user, signals to the Discovery
to apply a metadata condition.
Building the tree hierarchy
The JavaScript to create a top-level folder, such as "Geography", using the
Treeview control is:
folder0 =
insFld( foldersTree, gFld( "Geography", "javascript:undefined"));
|
Here, folder0 is a unique name given to the top-level
folder, insFld and gFld are
Treeview functions, foldersTree is the name of the
tree, and “Geography” is the label on the folder. The
last argument of the function gFld is the JavaScript
function called when the user clicks on the folder; since "Geography" is a top-level
folder, you don't want the click to do anything, and the function is a no-op.
There's additional code (not shown) to consistently assign a different name to
each node in the tree. By declaring this name as an "external ID" to the Treeview
control, this enables it to maintain its state in such a way that the appearance
of the tree remains the same when the results page is redrawn. From a user's
perspective, this provides more natural look and feel.
The Java code to loop through the top-level features is shown in Listing 2, below. Note
the calls to methods, like DrillDown.getTallyFeatures()
and TallyFeature.getLabel() from the Omnifind Discovery
Java library.
Listing 2. Java code to populate the tree
with top-level features
DrillDown drillDown = resultSet.getDrillDownPlus();
StringBuffer buf = new StringBuffer(); // Holds HTML
// Loop through top-level features
for (int j = 0; j < drillDown.getTallyFeatures().length; j++){
TallyFeature toplevelFeature = drillDown.getTallyFeatures()[j];
String nodeName = "folder" + j;
String label = toplevelFeature.getLabel(); // e.g. 'Geography'
// Code to emit Treeview JavaScript goes here
// Add the sub-tree for the top level feature.
addSubTree( buf, pageContext, toplevelFeature, nodeName);
} // end for j
|
For each top-level feature, or facet, there is a sub-tree of possible values. The
Java function addSubTree in
refineByTree.jsp creates the JavaScript for the
sub-tree and writes it into the buffer. One complication is that the Discovery
Java APIs only allow you to get a list of the nodes in the sub-tree and their
recursion levels, so you have to reconstruct the sub-tree from this information.
For a given entry in the list, you can find out if it has children by looking ahead
to see if the following entries have a higher recursion level. If so, then the
current entry must be a folder, and you push its name and recursion level onto an
auxiliary stack as well as adding it to the tree. Then, when you later find a list
entry with a lower recursion level, you can determine which folder that entry
belongs in by popping folders off the stack until you find one whose recursion
level is less than that of the list item.
Proceeding in this way, the function addSubTree emits Treeview JavaScript to
create either a folder node or a child node in much the same way as for a top
level folder. The main difference is that, now, each node in the tree can invoke a
JavaScript function if the user clicks on it, so that the corresponding metadata
condition can be added to the user's current query. The function
javascript:drillerDownMenus is supplied by a
JavaScript library that is part of the Discovery package. The code for
refineByList.jsp provides examples of its use. One
further complication is that the Treeview APIs require different delimiters for
the arguments of drillerDownMenus, depending if it is supplied as an argument for
insFld, which creates a folder (the delimiters must be
double quotes), or as an argument for insDoc, which
creates a leaf node (the delimiters must be single quotes).
Finally, adding some hover text to the tree nodes and specifying the CSS style
to be used in rendering them results in a tree control with a usable, consistent
look and feel, as shown in Figure 4. To satisfy the conditions of use of the
Treeview control for a demonstration application, a title with a link to the
Treeview Web page also has to be included in the output and can be seen in Figure 4.
Viewing a document
To complete our improved user interface, we want the title of an incident report
in the search-results list to link to a copy of the report. This would be easy if
the report were an existing Web page, but in this case, it is an XML fragment buried
in the WITS XML file. We need a tool that can extract the content of the
Incident element with a given ICN number from the file.
Using DB2 to store and query XML data
The tool you use is DB2, which we installed in the same laptop that hosts the other
components of the demo. Since Version 9, DB2 has been able to store XML documents
and return them, or parts of them, in response to a query. We store the WITS XML
file in a table, CROTON_DATA, whose schema is shown in
Figure 6:
Figure 6. The schema of the
CROTON_DATA table
This table has only one row. The whole WITS XML file is stored in the XML
column. If you were dealing with more than one XML file, for example, with report
sets from different periods, you would have additional rows. You will see in a
moment how the individual Incident elements can be selected out of the XML content
in the table.
To populate the table, simply import the file using an SQL command, as shown in
Listing 3 (it might be
necessary to increase the size of the DB2 log file first):
Listing 3. Import file using SQL command
IMPORT FROM "[path]\Croton\Datasets\incidents.del"
OF DEL XML FROM "[path]\Croton\Datasets"
METHOD P (1, 2) MESSAGES "c:\msg.txt"
INSERT INTO MARWICK.CROTON_DATA (NAME, "DATA");
|
where the file paths have been simplified. The content of the file
incidents.del that the command references is just:
"WITS","<XDS FIL='wits.xml'/>"
|
Of course, wits.xml is the name of the file containing all the incident reports,
of which a snippet was shown in Listing 1.
When DB2 9 imports the XML file, it parses and indexes the XML to allow XML
queries to be executed against it. DB2 9 contains an XML database engine that
works alongside the SQL database engine, and that makes XML queries very
efficient.
To retrieve a single incident report from the
CROTON_DATA table requires just an XML query. Listing 4
illustrates an
example query:
Listing 4. Retrieve a single incident report
xquery
for $Incident in
db2-fn:xmlcolumn( "CROTON_DATA.DATA")/IncidentList/Incident
where $Incident/ICN="200458437"
return $Incident;
|
This query returns the content of the requested Incident as an XML fragment.
Listing 5 shows how the XML query is issued from Java,
using JDBC and an existing
database connection instance, just as if it were a normal SQL query:
Listing 5. Java to return XML data by
issuing an XML query with JDBC
String result;
Statement stmt = conn.createStatement();
ResultSet rs = stmt.executeQuery( query);
if( rs.next()) {
result = rs.getString( 1); // return XML fragment
}
|
Returning XML data to Croton
The database access code is packaged as a session bean,
XMLData, within the Croton application. The bean is
used by a new JSP, showIncident, that is invoked when
the user clicks on an incident title in the search results list, using a link of the
following form:
http://localhost:8080/Croton/showIncident.jsp?icn=200458437.
|
To construct this URL when the results list page is built, the value of the ICN
feature for each result list item is obtained from the Omnifind Discovery Edition
server using Java library functions. When the user clicks on the link, the
showIncident JSP is invoked, and it retrieves the value
of the ICN feature from its request object. It then uses an
XMLData bean to retrieve the XML incident data into a
Java String by using the XML query already described. Finally, the incident data
XML is returned to the browser by showIncident.
But there is one last additional step. The browser may not do a good job of
rendering the raw XML. As a further courtesy to the user, you can use a style sheet to
convert the raw XML into a well-formatted HTML page. The style sheet specifies how
the XML is mapped to HTML. It drives a transformation engine in the user's
browser. We use a simple style sheet that does two things: (1) it defines
presentation styles, such as font and color, for use in the HTML output; and (2)
it creates HTML output that uses those styles and applies them to data from the
raw XML. The second step is illustrated in the following code snippet from the style
sheet:
<tr>
<td class="FacetMajorHeader">Subject</td>
<td><xsl:value-of select="//Incident/Subject"/></td>
</tr> |
Here, the content of the Subject element is being rendered in a table row. The
title ("Subject") is rendered with the style assigned to the element class
FacetMajorHeader. The select
attribute of the xsl:value-of element shows the XPath
expression that selects the content of the Subject
element in the XML and thus maps from the XML to the HTML defined by the style
sheet. The content is rendered with the containing table's style, not shown.
It only remains to insert a reference to the style sheet at the beginning of the
XML returned by showIncident.jsp. The result is
formatting like that shown in Figure 7.
Figure 7. An XML incident report
formatted with a style sheet
This is easier for the user to read than
the raw XML, although, for demo purposes, a link to the raw version of the XML is
included on the page. The style sheet includes code to handle features of the XML like
repeating elements in the XML, lists of element values, and so forth, though the
example shown doesn't include any of these.
Conclusion
This article started by pointing out that enterprise search, particularly where there is
metadata available, differs significantly from Internet search, and that a
different approach is needed to satisfy users' needs. The Croton demo described in
this article illustrates how faceted navigation enables text search conditions and
metadata conditions to be combined in a natural way, and thus allows users to get
the maximum benefit from an organization's investment in metadata creation. The
demo also illustrates how a standards-based approach, based on XML, XPath query
syntax, and Extensible Stylesheet Language, is made possible by the XML
capabilities of DB2 9 and IBM Omnifind Discovery Edition. The overall
solution is simplified by using the original XML schema, which combines both
content and metadata, for both indexing and storage of the data.
Resources Learn
Get products and technologies
-
Treeview: Download Treeview.
-
DB2 9: Download DB2 9.
- Build your next development project with IBM trial software, available for download directly from developerWorks.
Discuss
About the author  | 
|  | Alan Marwick has 15 years experience with text search and analysis techniques. In his present role, he looks for ways to apply IBM technology to the challenges faced by US Federal Government departments. Previously, he was in IBM Research, initially as a physicist, then leading teams working on knowledge-management technology. He has a PhD in physics from the University of Sussex in the UK, has published extensively in physics and computer science, and holds several patents. |
Rate this page
|  |