Linguistic support for dictionary-based segmentation
If the language of a document is correctly detected and language-specific dictionaries are available, then appropriate linguistic processing is applied.
- Word segmentation
- Word segmentation is used for languages that do not use white spaces (or delimiters) between words, such as Japanese and Chinese.
- Lemmatization
- Lemmatization is a form of linguistic processing that determines
the lemma for each word form that occurs in text. The lemma of
a word encompasses its base form plus inflected forms that share the
same part of speech. For example, the lemma for go encompasses go, goes, went, gone,
and going. Lemmas for nouns group singular and plural forms
(such as calf and calves). Lemmas for adjectives group
comparative and superlative forms (such as good, better,
and best). Lemmas for pronouns group different cases of the
same pronoun (such as I, me, my, and mine).
Lemmatization requires a dictionary for both indexing and searching.
Watson Explorer Content Analytics indexes the lemmas and the inflected words and lemmatizes all inflected words in a query. Lemmatization enhances search quality by finding documents that contain variants of an inflected word in the query. For example, documents that contain the word mice are found when a query includes the word mouse.
- Contraction splitting
- Search quality is improved by identifying contractions and splitting
them into their component parts. For example: wouldn't is split into would + not
Horse's is split into Horse + 's - Clitic identification
- Clitics are a special form of contractions, and search quality
is improved by determining their component parts. A clitic is
an element that behaves like an affix and a word. However, clitics
are difficult to identify because they are also part of word formation.
Unlike other morphological (word structure) phenomena, clitics occur
in a syntactic structure and their attachment to words is not part
of the word formation rules. For example:reparti-lo-emos has the components repartir + lo + emos
l'avenue has the components le + avenue
dell'arte has the components dello + arte. - Nonalphabetic character recognition
- The linguistic processes recognize nonalphabetic characters. Depending
on the internal language-dependent logic, some nonalphabetic characters
are returned as separate lexical units of different types, and some
are grouped.
For example, apostrophes in the case of clitics are considered word parts, and they are considered full stops (or periods) in the case of unknown abbreviations. URLs, email addresses and dates are split up into several tokens.
- Abbreviation recognition
- The linguistic processes recognize abbreviations that are in the
dictionary as one lexical unit. If the abbreviation is not in the
dictionary, then the abbreviation is recognized as a lexical item,
but the abbreviation will not have any associated dictionary information.
Recognizing abbreviations correctly is vital for sentence recognition. For example, the period at the end of an abbreviation is not necessarily the end of a sentence.
- End-of-sentence marker recognition
- The linguistic processes correctly identify end-of-sentence markers for sentence segmentation.