Linguistic support for dictionary-based segmentation

If the language of a document is correctly detected and language-specific dictionaries are available, then appropriate linguistic processing is applied.

Segmentation is the process by which input text is broken down into distinct lexical units. This process includes some of the following linguistic processing activities:

Word segmentation

Word segmentation is used for languages that do not use white spaces (or delimiters) between words, such as Japanese and Chinese.

Lemmatization

Lemmatization is a form of linguistic processing that determines the lemma for each word form that occurs in text. The lemma of a word encompasses its base form plus inflected forms that share the same part of speech. For example, the lemma for go encompasses go, goes, went, gone, and going. Lemmas for nouns group singular and plural forms (such as calf and calves). Lemmas for adjectives group comparative and superlative forms (such as good, better, and best). Lemmas for pronouns group different cases of the same pronoun (such as I, me, my, and mine).

Lemmatization requires a dictionary for both indexing and searching.

Watson Content Analytics indexes the lemmas and the inflected words and lemmatizes all inflected words in a query. Lemmatization enhances search quality by finding documents that contain variants of an inflected word in the query. For example, documents that contain the word mice are found when a query includes the word mouse.

Contraction splitting

Search quality is improved by identifying contractions and splitting them into their component parts. For example:

wouldn't is split into would + not
Horse's is split into Horse + 's

Clitic identification

Clitics are a special form of contractions, and search quality is improved by determining their component parts. A clitic is an element that behaves like an affix and a word. However, clitics are difficult to identify because they are also part of word formation. Unlike other morphological (word structure) phenomena, clitics occur in a syntactic structure and their attachment to words is not part of the word formation rules. For example:

reparti-lo-emos has the components repartir + lo + emos
l'avenue has the components le + avenue
dell'arte has the components dello + arte.

Nonalphabetic character recognition

The linguistic processes recognize nonalphabetic characters. Depending on the internal language-dependent logic, some nonalphabetic characters are returned as separate lexical units of different types, and some are grouped.

For example, apostrophes in the case of clitics are considered word parts, and they are considered full stops (or periods) in the case of unknown abbreviations. URLs, email addresses and dates are split up into several tokens.

Abbreviation recognition

The linguistic processes recognize abbreviations that are in the dictionary as one lexical unit. If the abbreviation is not in the dictionary, then the abbreviation is recognized as a lexical item, but the abbreviation will not have any associated dictionary information.

Recognizing abbreviations correctly is vital for sentence recognition. For example, the period at the end of an abbreviation is not necessarily the end of a sentence.

End-of-sentence marker recognition

The linguistic processes correctly identify end-of-sentence markers for sentence segmentation.