If the language of a document is correctly detected and
language-specific dictionaries are available, then appropriate linguistic
processing is applied.
Segmentation is the process by which input text is broken down
into distinct lexical units. This process includes some of the following
linguistic processing activities:
- Word segmentation
- Word segmentation is used for languages that do not use white
spaces (or delimiters) between words, such as Japanese and Chinese.
- Lemmatization
- Lemmatization is a form of linguistic processing that determines
the lemma for each word form that occurs in text. The lemma of
a word encompasses its base form plus inflected forms that share the
same part of speech. For example, the lemma for go encompasses go, goes, went, gone,
and going. Lemmas for nouns group singular and plural forms
(such as calf and calves). Lemmas for adjectives group
comparative and superlative forms (such as good, better,
and best). Lemmas for pronouns group different cases of the
same pronoun (such as I, me, my, and mine).
Lemmatization
requires a dictionary for both indexing and searching.
Watson Content Analytics indexes the lemmas and
the inflected words and lemmatizes all inflected words in a query.
Lemmatization enhances search quality by finding documents that contain
variants of an inflected word in the query. For example, documents
that contain the word mice are found when a query includes
the word mouse.
- Contraction splitting
- Search quality is improved by identifying contractions and splitting
them into their component parts. For example:
wouldn't is split into would + not
Horse's is split into Horse + 's
- Clitic identification
- Clitics are a special form of contractions, and search quality
is improved by determining their component parts. A clitic is
an element that behaves like an affix and a word. However, clitics
are difficult to identify because they are also part of word formation.
Unlike other morphological (word structure) phenomena, clitics occur
in a syntactic structure and their attachment to words is not part
of the word formation rules. For example:
reparti-lo-emos has the components repartir + lo + emos
l'avenue has the components le + avenue
dell'arte has the components dello + arte.
- Nonalphabetic character recognition
- The linguistic processes recognize nonalphabetic characters. Depending
on the internal language-dependent logic, some nonalphabetic characters
are returned as separate lexical units of different types, and some
are grouped.
For example, apostrophes in the case of clitics are
considered word parts, and they are considered full stops (or periods)
in the case of unknown abbreviations. URLs, email addresses and dates
are split up into several tokens.
- Abbreviation recognition
- The linguistic processes recognize abbreviations that are in the
dictionary as one lexical unit. If the abbreviation is not in the
dictionary, then the abbreviation is recognized as a lexical item,
but the abbreviation will not have any associated dictionary information.
Recognizing
abbreviations correctly is vital for sentence recognition. For example,
the period at the end of an abbreviation is not necessarily the end
of a sentence.
- End-of-sentence marker recognition
- The linguistic processes correctly identify end-of-sentence markers
for sentence segmentation.