The Art of Tokenization
CraigTrim 110000G799 Comments (5) Visits (34606)
The process of segmenting running text into words and sentences.
Electronic text is a linear sequence of symbols (characters or words or phrases). Naturally, before any real text processing is to be done, text needs to be segmented into linguistic units such as words, punctuation, numbers, alpha-numerics, etc. This process is called tokenization.
In English, words are often separated from each other by blanks (white space), but not all white space is equal. Both “Los Angeles” and “rock 'n' roll” are individual thoughts despite the fact that they contain multiple words and spaces. We may also need to separate single words like “I'm” into separate words “I” and “am”.
Tokenization is a kind of pre-processing in a sense; an identification of basic units to be processed. It is conventional to concentrate on pure analysis or generation while taking basic units for granted. Yet without these basic units clearly segregated it is impossible to carry out any analysis or generation.
The identification of units that do not need to be further decomposed for subsequent processing is an extremely important one. Errors made at this stage are very likely to induce more errors at later stages of text processing and are therefore very dangerous.
What counts as a token in NLP?
The notion of a token must first be defined before computational processing can proceed. There is more to the issue than simply identifying strings delimited on both sides by spaces or punctuation.
Different notions depend on different objectives, and often different language backgrounds.
Webster and Kit suggest that finding significant tokens depends on the ability to recognize patterns displaying significant collocation. Rather than simply relying on wehther a string is bounded by delimters on either side, segmentation into significant tokens relies on a kind of pattern recognition.
Consider this hypothetical speech transcription:
where is meadows dr who asked
Collocation patterns could help determine if this is about meadows dr (Drive) or dr (Doctor) who.
Standard (White Space) Tokenization
Word tokenization may seem simple in a language that separates words by a special 'space' character. However, not every language does this (e.g. Chinese, Japanese, Thai), and a closer examination will make it clear that white space alone is not sufficient even for English.
Addressing Specific Challenges
Tokenization is generally considered as easy relative to other tasks in natural language, and one of the more uninteresting tasks (for English and other segmented languages). However, errors made in this phase will propogate into later phases and cause problems. To address this problem, a number of advanced methods which deal with specific challenges in tokenization have been developed to complement standard tokenizers.
Bob Carpenter states that tokenization is particularly vexing in the bio-medical text domain, where there are tons of words (or at least phrasal lexical entries) that contain parentheses, hyphens, and so on, and that this turned out to be a problem for WordNet).
Another challenge for tokenization is “dirty text”1. Not all text has been passed through an editing and spell-check process. Text extracted automatically from PDFs, database fields, or other sources may contain inaccurately compounded tokens, spelling errors and unexpected characters. In some cases, when text is stored in a database in fixed fields, with multiple lines per object, fields sometimes need to be reassembled but the spaces have (inconsistently) been trimmed.
It is not safe to make the assumption that source text will be perfect. A tokenizer must often be customized to the data in question.
Low-Level vs High-Level Tokenization
Determining if two or more words should stand together to form a single token (like “Rational Software Architect”) would be a high-level tokenization task. High-level segmentation is much more linguistically motivated than 'low-level' segmentation, and requires (at a minimum) relatively shallow linguistic processing.
Steps in Low Level Tokenization
Step 1: Segmenting Text into Words
The first step in the majority of text processing applications is to segment text into words.
In all modern languages that use a Latin-, Cyrillic-, or Greek-based writing system, such as English and other European languages, word tokens are delimited by a blank space. Thus, for such languages, which are called segmented languages, token boundary identification is a somewhat trivial task since the majority of tokens are bound by explicit separators like spaces and punctuation. A simple program which replaces white spaces with word boundaries and cuts off leading and trailing quotation marks, parentheses and punctuation already produces a reasonable performance.
The majority of existing tokenizers signal token boundaries by white spaces. Thus, if such a tokenizer finds two tokens directly adjacent to each other, as, for instance, when a word is followed by a comma, it inserts a white space between them.
The example given in a following section will show how a standard white space tokenizer fares in a more complex example
Step 2: Handling Abbreviations
In English and other Indo-European languages although a period is directly attached to the previous word, it is usually a separate token which signals the end of the sentence. However, when a period follows an abbreviation it is an integral part of this abbreviation and should be tokenized together with it.
the dr. lives in a blue box.
Without addressing the challenge posed by abbreviation, this line would be delimited into
Unfortunately, universally accepted standards for many abbreviations and acronyms do not exist.
The most widely adopted approach to the recognition of abbreviations is to maintain a list of known abbreviations. Thus during tokenization a word with a trailing period can be looked up in such a list and, if it is found there, it is tokenized as a single token, otherwise the period is tokenized as a separate token. Naturally, the accuracy of this approach depends on how well the list of abbreviations is tailored to the text under processing. There will almost certainly be abbreviations in the text which are not included in the list. Also, abbreviations in the list can coincide with common words and trigger erroneous tokenization. For instance, `in' can be an abbreviation for `inches; `no' can be an abbreviation for `number, `bus' can be an abbreviation for `business; `sun' can be an abbreviation for `Sunday; etc.
The following lists are by no means comprehensive:
Step 3: Handling Hyphenated Words
Segmentation of hyphenated words answers a question `One word or two?'
Hyphenated segments present a case of ambiguity for a tokenizer-sometimes a hyphen is part of a token, i.e. self-assessment, F-15, forty-two and sometimes it is not e.g. Los Angeles-based.
Segmentation of hyphenated words is task dependent. For instance, part-of-speech taggers (Chapter ii) usually treat hyphenated words as a single syntactic unit and therefore prefer them to be tokenized as single tokens. On the other hand named entity recognition (NER) systems (Chapter 30) attempt to split a named entity from the rest of a hyphenated fragment; e.g. in parsing the fragment `Moscow-based' such a system needs `Moscow' to be tokenized separately from `based' to be able to tag it as a location.
End-of-line hyphens are used for splitting whole words into parts to perform justification of text during typesetting. Therefore they should be removed during tokenization because they are not part of the word but rather layouting instructions.
True hyphens, on the other hand, are integral parts of complex tokens, e.g.forty-seven, and should therefore not be removed. Sometimes it is difficult to distinguish a true hyphen from an end-of-line hyphen when a hyphen occurs at the end of a line.
Hyphenated compound words which have made their way into standard language vocabularly. For instance, certain prefixes (and less commonly suffixes) are often written hyphenated, e.g. co-, pre-, meta-, multi-, etc.
Sententially Determined Hyphenation
Here hyphenated forms are created dynamically as a mechanism to prevent incorrect parsing of the phrase in which the words appear. There are several types of hyphenation in this class. One is created when a noun is modified by an `ed'-verb to dynamically create an adjective, e.g. case-based, computer-linked, hand-delivered. Another case involves an entire expression when it is used as a modifier in a noun group, as in a three-to-five-year direct marketing plan. In treating these cases a lexical look-up strategy is not much help and normally such expressions are treated as a single token unless there is a need to recognize specific tokens, such as dates, measures, names, in which case they are handled by specialized subgrammars
This hypothetical sentence poes many challenges:
the New York-based co-operative was fine- tuning forty-two K-9-like models.
Step 3: Numerical and special expressions
These can produce a lot of confusion to a tokenizer because they usually involve rather complex alpha numerical and punctuation syntax.
Take phone numbers for example -
A pre-processor should be designed to recognize phone numbers and perform normalization. All phone numbers would then be in a single format, making the job of a tokenizer easier.
A pre-processor could recognize all these distinct variations and normalize into a single expression.
"I said, 'what're you? Crazy?'" said Sandowsky. "I can't afford to do that."
The naïve white space parser is shown to perform poorly here.
The Stanford tokenizer does somewhat better than the OpenNLP tokenizer, which is to be expected. The custom parser (included in the appendix) in the 4th column, does a nearly perfect job, though without the enclitic expansion shown in the first hypothetical pass.
The more accurate (and complex) segmentation process in the fourth and fifth columns require a morphological parsing process.
We can address some of these issues in the first three examples by treating punctuation, in addition to white space, as a word boundary. But punctuation often occurs internally, in examples like u.s.a., Ph.D., AT&T, ma'am, cap'n, 01/02/06 and stanford.edu. Similarity, assuming we want 7.1 or 82.4 as a word, we can't segment on every period, since that would segment these into "7" and "1" and "82" and "4". Should "data-base" be considered two separate tokens or a single token? The number "$2,023.74" should be considered a single token, but in this case, the comma and period do not represent delimiters, where in other cases they might. And should the "$" sign be considered part of that token, or a separate token in its own right?
Named Entity Extraction
It's almost impossible to separate tokenization from named entity extraction. It really isn't possible to come up with a generic set of rules that will handle all ambiguous cases within English; the easiest approach is usually just to have multi-word expression dictionaries.
Install Rational Software Architect on AIX 5.3
Dictionaries will have to exist that express to the tokenization process that "Rational Software Architect for WebSphere" is a single token (a product), and "AIX 5.3" is likewise a single product.
The impact that tokenization has upon the rest of the process can not be understated. A typical next step, following tokenization, is to send the segmented text to a deep parser. In the first column, the rational product would end up being deep parsed into a structure like this:
OpenNLP 1.5.2 (en-
<node prob="0.99" span="Rational Software Architect for WebSphere" type="NP">
<node prob="1.0" span="Rational Software Architect" type="NP">
<node prob="0.86" span="Rational" type="NNP"/>
<node prob="0.94" span="Software" type="NNP"/>
<node prob="0.93" span="Architect" type="NNP"/>
<node prob="0.99" span="for WebSphere" type="PP">
<node prob="0.93" span="for" type="IN"/>
<node prob="1.0" span="WebSphere" type="NP">
<node prob="0.24" span="WebSphere" type="NNP"/>
(Output from Stanford 2.0.3 is identical)
Note the formation of a prepositional phrase (PP) around "for WebSphere" and the noun phrase trigram "Rational Software Architect". If the sentence was semantically segmented with the aid of a multi-word dictionary, the output from the deep parser would have looked like this:
<node span="Rational Software Architect for WebSphere" type="NP">
<node span="Rational Software Architect for WebSphere" type="NNP"/>
There is a single noun phrase containing one noun (NNP = singular noun1).
A clitic is a unit whose status lies between that of an affix and a word. The phonological behavior of a clitic is like affixes; they tend to be short and unaccented. Their syntactic behavior is more like words, often acting as pronouns, articles, conjunctions, or verbs. Clitics preceding a word are called proclitics, and those following are enclictics.
English enclitics include:
The abbreviated forms of be:
The abbreviated forms of auxiliary verbs:
Note that clitics in English are ambiguous. The word "she's" can mean "she has" or "she is".
A tokenizer can also be used to expand clitic contractions that are marked by apostrophes, for example:
what're => what are
This requires ambiguity resolution, since apostrophes are also used as genitive markers as in "the book's over in the containers' above" or as quotative markers. While these contractions tend to be clictics, not all clictics are marked this way with contractions. In general, then, segmenting and expanding clitics and be done as part of a morphological parsing process.
Enclitic Analysis (New York Times):
Appendix A - Anlaysis of clitic usage from the NYT Corpus
Appendix B - Source Code