From the IBM WebSphere Developer Technical Journal.
You entered a list of keywords, clicked the Search button, and instantaneously got back a list of relevant documents. Easy? Yes. Magic? Not really.
Information retrieval, or simply "search" for short, is now a fairly fundamental field of computer science. From a technical perspective, the inverted index, used to store keywords that appear in documents and enable searches, is a well-known and well-described data structure.
To be fair, finding the list of documents that contain the keywords is the easy part, while ranking techniques still involve a lot of black magic and secret evaluation formulas. Indeed, taking the thousands of documents that contain the requested keywords and ordering them so the most relevant for the user are at the top is no easy task. Still, the mathematical models used to rank results are usually some variation of the common term-frequency/inversed document frequency model, which is well-mapped territory.
An alternative to using keywords, you can also search using a browsing approach, which locates documents using a hierarchical, directory-like structure. These hierarchies, sometimes referred to as taxonomies, are usually built by specialists. For example, the Open Directory project, a general taxonomy used for Web sites, was once very popular.
A variation of this basic search approach is the advanced search, where searching is performed using the values of a pre-selected set of specific fields, such as title, author, and so on. Traditionally, this is considered a feature for "power users" -- those who are looking for a specific source, as opposed to any source on a specific topic. For example, you would use an advanced search if you knew the exact title, an author's name, year of publication, or some other precise piece of information about the document you were trying to find.
Although not exclusively for "advanced" users, advanced search isn't for everyone. If you know the details about what you are searching for, then what is keeping you from using advanced search and being a so-called power user? Advanced search can be cumbersome. My thinking is that the user interfaces and computer languages that are used to define these advanced conditions are usually just too convoluted or complex to be within reach for mere mortals.
But things are changing and there are now many more ways to reach content. To understand the trends in search experience -- and to understand how to get the search results you need, it is important to distinguish between two ways of using a search engines: discovery and reach.
When you search the Web, you typically do so to discover what is available out there; before you begin searching, you have no specific knowledge of what resources exists. On the other hand, when you search your hard disk, you are usually trying to reach a piece of information that you know is there -- you just don't know where it is. These are the two basic scenarios: discovery versus reaching. In other words, search differentiates between looking for what exists on a particular subject versus looking for a particular piece of information that you already know about.
Post-filtering and facets: The end of advanced search?
Clearly, advanced search is used primarily in reaching scenarios, which are rather infrequent when searching the Web. This is why most recent Web search engines have abandoned the so-called advanced search feature. On the other hand, when you search your personal data -- or even the company intranet -- you usually know some details about what you are looking for.
Enter a new search paradigm called faceted search. The value of faceted search is that it enables users to explore the results and refine their searches more precisely, without resorting to using the complex Boolean conditions of the advanced search syntax. When you use a search engine that supports faceted search features, you get a set of facets -- each facet representing an attribute, such as author, date, and so on -- in addition to the results of your search. For each facet, you see a list of possible values relevant to your search, together with a number representing the quantity of results that contain this value.
For example, suppose you are searching your own machine for the terms "marketing chicken." The search engine might return a list of results for those terms, together with the following facets:
Source: Lotus Notes (34), Disk (26)
Author: John H. (12), David K (23), Others (25)
Updated: Last Week (15), Last Month (45)
This means that in the spectrum of search results, 34 occurrences are stored in your Lotus® Notes e-mail while 26 are on your disk. Independent of location, John H. is the author of 12 results, David K. of 23, and other people have authored 25 items. Also, independent of both location and author, 15 items were updated last week, and 45 in the past month.
How do you work with these facets? Easily. Facets are links, so when you decide which facet is most important to you, you simply click on it and the filtered results from that source will be displayed -- and those results are displayed organized by the remaining facets. For example, click the Lotus Notes link, and the following facets might appear:
Author: John H. (3), David K (11), Others (20)
Updated: Last Week (11), Last Month (23)
When considering only those results from Lotus Notes, you can continue to drill down through the most important facets and target your search more precisely until you find the information you need. You can undo your selection at any time to explore other paths.
The advantages of this approach are probably obvious. When attribute-rich search results are available, you can explore them in an easy and intuitive way, without resorting to Boolean logic. Additionally, since the potential breadth of content available under each facet is presented by a numerical value, you can insure against exploring any "empty" paths.
On the Internet, faceted searching is currently prevalent in e-merchant sites, such as those for consumer electronics, which have enough features and segments to make this kind of navigation meaningful. Although facet searching might seem less relevant for standard Web pages right now, there are enough features to make faceted searches practical in the context of personal and intranet search engines, and so I believe we will be seeing much more of this feature in the coming years, as we search more and more through e-mail, personal files, and intranet information.
Tagging and social search: The end of taxonomies?
As mentioned earlier, you can use taxonomies to locate subjects of interest and corresponding documents. Taxonomies, though, are difficult to maintain. When a new document is created, someone has to associate it with the right categories in this complex data structure. This leads us to a new trend known as folksonomies : rather than depending on taxonomy specialists or other people to categorize information after the fact, you can now categorize a document that you authored or accessed. This is done simply by associating a document with a single world or tag. Then, any user can search and explore this tag space, view the most frequent tags, see what documents have been tagged with that word, and so on. In theory, tags will help more people find more documents. For example, a paper about microelectronics could be tagged with that word -- even if the term "microelectronics" does not appear in the content of the document itself. Thus, users searching for microelectronics-related content using that word as a tag would be able to find this document in their search results -- thanks to whoever had the foresight or reason to create the tag, and to the dynamic practicality of tagging itself. Such would not have been the case using keywords.
In practice, of course, there are some things that can complicate the use of tagging. Since tagging can be done freely by anybody, nothing prevents irrelevant tags or tagging errors from accumulating slowly in search collections over time. In enterprise settings, security is also an issue. If users are not allowed to access a document, should they be able to see the document's associated tags when they explore the tag space? This approach may simplify and improve search in private controlled collections (your team files, your photos), but can also introduce more complexity in searching through large enterprise-wide document collections.
When you search, you probably like to search through more than one source. On the Web, for example, you are generally limited to using one search engine only because it is cumbersome to use several and compare. When searching your personal files or the enterprise intranet, this problem becomes even more complicated. There are several search engines, accessed from different user interfaces, over different sets of data, yielding several different sets of results. Two approaches, federation and aggregation can help in this case.
Federation means that some tool registers all the search engines that are available, forwards any search you do to all of them, retrieves the results, and presents the combined results to you in a single list. In most cases, it will be difficult to rank the results uniformly. (In some search engines, a relevance score of 85% for a result is a "good" score, while in others, it is 95%.) Thus, when federation is used, the results from each source will most likely be presented in isolation. (Faceted search might be helpful here.)
When using aggregation, a central search engine gathers all the searchable content. One location is searched and the uniform results are returned from all the sources. Thus, aggregation is handy but difficult to achieve, since it requires crawling through all the available data sources and scaling the search engine. Most likely, you will have to live with several levels of federation and aggregation -- for the time being.
Along with more and more information to search, we also have more and more ways to conduct the search. We in the WebSphere Portal Search Development team want to help you explore the different ways that you can make it easier for end users to find information in the enterprise. This is why we published public APIs and freely-available search tools. These are geared towards helping you experiment with developing more effective search applications and finding your way through the ever-growing information space that surrounds us all.
-
Open Directory Project
-
The Top 100 Alternative Search Engines
-
Introducing the Search and Indexing API in WebSphere Portal V6.0
-
Unleashing the power of WebSphere Portal V6 Search with the Portal Search Toolbox





