IBM® Content
Analytics with Enterprise Search provides
linguistic support for semantic search in most Indo-European languages
and Asian languages, including Japanese.
You can use the linguistic support to improve the quality of search
results.
Linguistic processing is performed in two stages: when a text document
is processed to be added into the index, and when a user enters a
query.
IBM Content
Analytics with Enterprise Search includes basic
linguistic functions that are used to determine the language of an
input document and to segment the document input stream into words
or tokens.
If you know that your searches will be restricted primarily to
basic facet value searches or native XML searches that uses the document
structure, the included linguistic processing adequately covers your
needs.
Most information in text documents is unstructured, which makes
it difficult to use effectively because it is not easy to access the
meaning of the information.
Searching for keywords is simple, but it is not always satisfactory
if you want to search beyond the mere words in the document, as is
illustrated in the following examples:
- In collaboration cases, information is not always explicitly marked,
for example, an address or a phone number in an email. In fact, the
term phone number might not be used at all. Instead the email
might contain a phrase such as "you can reach me at 555-641-1805".
The user often does not know how the information that he or she wants
to search for is presented in the document, and would ideally want
to enter a query like "Barbara phone number" when looking for the
phone number of someone called Barbara. However, this query will not
be successful because the word phone number does not occur
in the document.
- In competitive intelligence, documents mention competitors and
the goods that they supply or that the competitor's website shifted
over the past three months from selling one product set to another.
In this case, the user might enter a query like "Smith & Co. goods"
or "Smith & Co. goods Nov. 2004 till Jan. 2005". In the first
query, the term goods stands for a product or range of products,
but the query will not return the products supplied by Smith &
Co. because it is looking for the term goods. The same applies
to the query that include a particular time period. It is almost impossible
to query a time period by using keyword search.
- In customer relationship management, documents might mention automobile
brake problems in repair shops in the San Francisco area. The repair
shop reports describe situations such as "shoe adjusted because of
a hydraulic leak". The user querying for more detailed information
might enter a query like "brake problem repair shops in north San
Francisco". However, this query might not return any reports that
talk about "shoe adjusted because of a hydraulic leak" because the
terms brake problem or repair shops as such do not occur
in the reports. Moreover, these reports might mention only the street
and district name of the repair shop, not the full address including
the city name San Francisco.
- In research, documents describe a particular drug widely marketed
under various trademarks and its relation to at least one disease
that is mentioned in the same paragraph. The casual user might enter
a query using one of the popular terms for the drug hoping for a more
detailed account of the various illnesses including symptoms. However,
the query might not return satisfactory documents because the popular
term might not always be used in documents and these documents often
do not mention the word illness at all, only the name of the
illness itself.