IBM Content Analytics with Enterprise Search, Version 3.0.0

Linguistic support for semantic search

IBM® Content Analytics with Enterprise Search provides linguistic support for semantic search in most Indo-European languages and Asian languages, including Japanese.

You can use the linguistic support to improve the quality of search results.

Linguistic processing is performed in two stages: when a text document is processed to be added into the index, and when a user enters a query.

IBM Content Analytics with Enterprise Search includes basic linguistic functions that are used to determine the language of an input document and to segment the document input stream into words or tokens.

If you know that your searches will be restricted primarily to basic facet value searches or native XML searches that uses the document structure, the included linguistic processing adequately covers your needs.

Most information in text documents is unstructured, which makes it difficult to use effectively because it is not easy to access the meaning of the information.

Searching for keywords is simple, but it is not always satisfactory if you want to search beyond the mere words in the document, as is illustrated in the following examples:

In collaboration cases, information is not always explicitly marked, for example, an address or a phone number in an email. In fact, the term phone number might not be used at all. Instead the email might contain a phrase such as "you can reach me at 555-641-1805". The user often does not know how the information that he or she wants to search for is presented in the document, and would ideally want to enter a query like "Barbara phone number" when looking for the phone number of someone called Barbara. However, this query will not be successful because the word phone number does not occur in the document.
In competitive intelligence, documents mention competitors and the goods that they supply or that the competitor's website shifted over the past three months from selling one product set to another. In this case, the user might enter a query like "Smith & Co. goods" or "Smith & Co. goods Nov. 2004 till Jan. 2005". In the first query, the term goods stands for a product or range of products, but the query will not return the products supplied by Smith & Co. because it is looking for the term goods. The same applies to the query that include a particular time period. It is almost impossible to query a time period by using keyword search.
In customer relationship management, documents might mention automobile brake problems in repair shops in the San Francisco area. The repair shop reports describe situations such as "shoe adjusted because of a hydraulic leak". The user querying for more detailed information might enter a query like "brake problem repair shops in north San Francisco". However, this query might not return any reports that talk about "shoe adjusted because of a hydraulic leak" because the terms brake problem or repair shops as such do not occur in the reports. Moreover, these reports might mention only the street and district name of the repair shop, not the full address including the city name San Francisco.
In research, documents describe a particular drug widely marketed under various trademarks and its relation to at least one disease that is mentioned in the same paragraph. The casual user might enter a query using one of the popular terms for the drug hoping for a more detailed account of the various illnesses including symptoms. However, the query might not return satisfactory documents because the popular term might not always be used in documents and these documents often do not mention the word illness at all, only the name of the illness itself.

Feedback