Skip to main content

By clicking Submit, you agree to the developerWorks terms of use.

The first time you sign into developerWorks, a profile is created for you. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

All information submitted is secure.

  • Close [x]

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerworks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

By clicking Submit, you agree to the developerWorks terms of use.

All information submitted is secure.

  • Close [x]

Comment lines: David Konopnicki: Search has changed -- have you?

David Konopnicki (Davidko@il.ibm.com), Search Technical Lead, IBM
David Konopnicki works in the IBM Haifa Lab in Haifa, Israel. He is the technical lead of the Search team in the Workplace, Portal, and Collaboration (WPLC) products division.

Summary:  An overview of how search has evolved, from keywords and browsing, to faceted search, folksonomies, discovery, and reach. See what's new and what's coming to keep your applications on the leading edge.

Date:  09 May 2007
Level:  Introductory
Also available in:   Chinese

Activity:  5002 views
Comments:  

From the IBM WebSphere Developer Technical Journal.

Seek and you shall find

You entered a list of keywords, clicked the Search button, and instantaneously got back a list of relevant documents. Easy? Yes. Magic? Not really.

Information retrieval, or simply "search" for short, is now a fairly fundamental field of computer science. From a technical perspective, the inverted index, used to store keywords that appear in documents and enable searches, is a well-known and well-described data structure.

To be fair, finding the list of documents that contain the keywords is the easy part, while ranking techniques still involve a lot of black magic and secret evaluation formulas. Indeed, taking the thousands of documents that contain the requested keywords and ordering them so the most relevant for the user are at the top is no easy task. Still, the mathematical models used to rank results are usually some variation of the common term-frequency/inversed document frequency model, which is well-mapped territory.

Each year, the National Institute of Standards and Technology organizes the "Olympic games" of search engines: known as the TREC challenge, this event provides participants with a haystack of documents in which they must find only a few needles. Unfortunately, most commercial search engines refuse to be evaluated using such scientific methods.

An alternative to using keywords, you can also search using a browsing approach, which locates documents using a hierarchical, directory-like structure. These hierarchies, sometimes referred to as taxonomies, are usually built by specialists. For example, the Open Directory project, a general taxonomy used for Web sites, was once very popular.

A variation of this basic search approach is the advanced search, where searching is performed using the values of a pre-selected set of specific fields, such as title, author, and so on. Traditionally, this is considered a feature for "power users" -- those who are looking for a specific source, as opposed to any source on a specific topic. For example, you would use an advanced search if you knew the exact title, an author's name, year of publication, or some other precise piece of information about the document you were trying to find.

Although not exclusively for "advanced" users, advanced search isn't for everyone. If you know the details about what you are searching for, then what is keeping you from using advanced search and being a so-called power user? Advanced search can be cumbersome. My thinking is that the user interfaces and computer languages that are used to define these advanced conditions are usually just too convoluted or complex to be within reach for mere mortals.

But things are changing and there are now many more ways to reach content. To understand the trends in search experience -- and to understand how to get the search results you need, it is important to distinguish between two ways of using a search engines: discovery and reach.

When you search the Web, you typically do so to discover what is available out there; before you begin searching, you have no specific knowledge of what resources exists. On the other hand, when you search your hard disk, you are usually trying to reach a piece of information that you know is there -- you just don't know where it is. These are the two basic scenarios: discovery versus reaching. In other words, search differentiates between looking for what exists on a particular subject versus looking for a particular piece of information that you already know about.


Post-filtering and facets: The end of advanced search?

Clearly, advanced search is used primarily in reaching scenarios, which are rather infrequent when searching the Web. This is why most recent Web search engines have abandoned the so-called advanced search feature. On the other hand, when you search your personal data -- or even the company intranet -- you usually know some details about what you are looking for.

Enter a new search paradigm called faceted search. The value of faceted search is that it enables users to explore the results and refine their searches more precisely, without resorting to using the complex Boolean conditions of the advanced search syntax. When you use a search engine that supports faceted search features, you get a set of facets -- each facet representing an attribute, such as author, date, and so on -- in addition to the results of your search. For each facet, you see a list of possible values relevant to your search, together with a number representing the quantity of results that contain this value.

For example, suppose you are searching your own machine for the terms "marketing chicken." The search engine might return a list of results for those terms, together with the following facets:

Source: Lotus Notes (34), Disk (26)
Author: John H. (12), David K (23), Others (25)
Updated: Last Week (15), Last Month (45)

This means that in the spectrum of search results, 34 occurrences are stored in your Lotus® Notes e-mail while 26 are on your disk. Independent of location, John H. is the author of 12 results, David K. of 23, and other people have authored 25 items. Also, independent of both location and author, 15 items were updated last week, and 45 in the past month.

How do you work with these facets? Easily. Facets are links, so when you decide which facet is most important to you, you simply click on it and the filtered results from that source will be displayed -- and those results are displayed organized by the remaining facets. For example, click the Lotus Notes link, and the following facets might appear:

Author: John H. (3), David K (11), Others (20)
Updated: Last Week (11), Last Month (23)

When considering only those results from Lotus Notes, you can continue to drill down through the most important facets and target your search more precisely until you find the information you need. You can undo your selection at any time to explore other paths.

The advantages of this approach are probably obvious. When attribute-rich search results are available, you can explore them in an easy and intuitive way, without resorting to Boolean logic. Additionally, since the potential breadth of content available under each facet is presented by a numerical value, you can insure against exploring any "empty" paths.

On the Internet, faceted searching is currently prevalent in e-merchant sites, such as those for consumer electronics, which have enough features and segments to make this kind of navigation meaningful. Although facet searching might seem less relevant for standard Web pages right now, there are enough features to make faceted searches practical in the context of personal and intranet search engines, and so I believe we will be seeing much more of this feature in the coming years, as we search more and more through e-mail, personal files, and intranet information.


Tagging and social search: The end of taxonomies?

As mentioned earlier, you can use taxonomies to locate subjects of interest and corresponding documents. Taxonomies, though, are difficult to maintain. When a new document is created, someone has to associate it with the right categories in this complex data structure. This leads us to a new trend known as folksonomies : rather than depending on taxonomy specialists or other people to categorize information after the fact, you can now categorize a document that you authored or accessed. This is done simply by associating a document with a single world or tag. Then, any user can search and explore this tag space, view the most frequent tags, see what documents have been tagged with that word, and so on. In theory, tags will help more people find more documents. For example, a paper about microelectronics could be tagged with that word -- even if the term "microelectronics" does not appear in the content of the document itself. Thus, users searching for microelectronics-related content using that word as a tag would be able to find this document in their search results -- thanks to whoever had the foresight or reason to create the tag, and to the dynamic practicality of tagging itself. Such would not have been the case using keywords.

In practice, of course, there are some things that can complicate the use of tagging. Since tagging can be done freely by anybody, nothing prevents irrelevant tags or tagging errors from accumulating slowly in search collections over time. In enterprise settings, security is also an issue. If users are not allowed to access a document, should they be able to see the document's associated tags when they explore the tag space? This approach may simplify and improve search in private controlled collections (your team files, your photos), but can also introduce more complexity in searching through large enterprise-wide document collections.


Results from multiple sources

When you search, you probably like to search through more than one source. On the Web, for example, you are generally limited to using one search engine only because it is cumbersome to use several and compare. When searching your personal files or the enterprise intranet, this problem becomes even more complicated. There are several search engines, accessed from different user interfaces, over different sets of data, yielding several different sets of results. Two approaches, federation and aggregation can help in this case.

Federation means that some tool registers all the search engines that are available, forwards any search you do to all of them, retrieves the results, and presents the combined results to you in a single list. In most cases, it will be difficult to rank the results uniformly. (In some search engines, a relevance score of 85% for a result is a "good" score, while in others, it is 95%.) Thus, when federation is used, the results from each source will most likely be presented in isolation. (Faceted search might be helpful here.)

When using aggregation, a central search engine gathers all the searchable content. One location is searched and the uniform results are returned from all the sources. Thus, aggregation is handy but difficult to achieve, since it requires crawling through all the available data sources and scaling the search engine. Most likely, you will have to live with several levels of federation and aggregation -- for the time being.


Portal search toolbox

Along with more and more information to search, we also have more and more ways to conduct the search. We in the WebSphere Portal Search Development team want to help you explore the different ways that you can make it easier for end users to find information in the enterprise. This is why we published public APIs and freely-available search tools. These are geared towards helping you experiment with developing more effective search applications and finding your way through the ever-growing information space that surrounds us all.


Resources

About the author

David Konopnicki

David Konopnicki works in the IBM Haifa Lab in Haifa, Israel. He is the technical lead of the Search team in the Workplace, Portal, and Collaboration (WPLC) products division.

Report abuse help

Report abuse

Thank you. This entry has been flagged for moderator attention.


Report abuse help

Report abuse

Report abuse submission failed. Please try again later.


developerWorks: Sign in


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Select information in your developerWorks profile is displayed to the public, but you may edit the information at any time. Your first name, last name (unless you choose to hide them), and display name will accompany the content that you post.

Choose your display name

The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


Rate this article

Comments

Help: Update or add to My dW interests

What's this?

This little timesaver lets you update your My developerWorks profile with just one click! The general subject of this content (AIX and UNIX, Information Management, Lotus, Rational, Tivoli, WebSphere, Java, Linux, Open source, SOA and Web services, Web development, or XML) will be added to the interests section of your profile, if it's not there already. You only need to be logged in to My developerWorks.

And what's the point of adding your interests to your profile? That's how you find other users with the same interests as yours, and see what they're reading and contributing to the community. Your interests also help us recommend relevant developerWorks content to you.

View your My developerWorks profile

Return from help

Help: Remove from My dW interests

What's this?

Removing this interest does not alter your profile, but rather removes this piece of content from a list of all content for which you've indicated interest. In a future enhancement to My developerWorks, you'll be able to see a record of that content.

View your My developerWorks profile

Return from help

static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=WebSphere
ArticleID=218192
ArticleTitle=Comment lines: David Konopnicki: Search has changed -- have you?
publish-date=05092007
author1-email=Davidko@il.ibm.com
author1-email-cc=

Tags

Help
Use the search field to find all types of content in My developerWorks with that tag.

Use the slider bar to see more or fewer tags.

For articles in technology zones (such as Java technology, Linux, Open source, XML), Popular tags shows the top tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), Popular tags shows the top tags for just that product zone.

For articles in technology zones (such as Java technology, Linux, Open source, XML), My tags shows your tags for all technology zones. For articles in product zones (such as Info Mgmt, Rational, WebSphere), My tags shows your tags for just that product zone.

Use the search field to find all types of content in My developerWorks with that tag. Popular tags shows the top tags for this particular content zone (for example, Java technology, Linux, WebSphere). My tags shows your tags for this particular content zone (for example, Java technology, Linux, WebSphere).

Try IBM PureSystems. No charge.

Special offers