Your humble writer knows a little bit about a lot of things, but despite writing a fair amount about text processing (a book, for example), linguistic processing is a relatively novel area for me. Forgive me if I stumble through my explanations of the quite remarkable Natural Language Toolkit (NLTK), a wonderful tool for teaching, and working in, computational linguistics using Python. Computational linguistics, moreover, is closely related to the fields of artificial intelligence, language/speech recognition, translation, and grammar checking.
It is natural to think of NLTK as a stacked series of layers that build on each other. Readers familiar with lexing and parsing of artificial languages (like, say, Python) will not have too much of a leap to understand the similar -- but deeper -- layers involved in natural language modeling.
While NLTK comes with a number of corpora that have been pre-processed (often manually) to various degrees, conceptually each layer relies on the processing in the adjacent lower layer. Tokenization comes first; then words are tagged; then groups of words are parsed into grammatical elements, like noun phrases or sentences (according to one of several techniques, each with advantages and drawbacks); and finally sentences or other grammatical units can be classified. Along the way, NLTK gives you the ability to generate statistics about occurrences of various elements, and draw graphs that represent either the processing itself, or statistical aggregates in results.
In this article, you'll see some relatively fleshed-out examples from the lower-level capabilities, but most of the higher-level capabilities will be simply described abstractly. Let's now take the first steps past text processing, narrowly construed.
Much of what you can do with NLTK, particularly at its lower levels, is not that much different from what you can do with Python's basic data structures. But NLTK provides a set of regularized interfaces that are relied on and utilized at higher levels, in addition to simply providing convenient classes to hold tokenized and tagged texts.
In particular, the class nltk.tokenizer.Token is used widely to
store annotated segments of text; these annotations can mark a number
of different features, including parts-of-speech, subtoken structures,
offsets of a token within a larger text, morphological stems,
grammatical sentence components, and so on. In fact, a Token is a
special kind of dictionary -- and is accessed in the fashion of a
dictionary -- so it can contain whatever keys you like. A few special
keys are used in NLTK, different ones by the various subpackages.
Let's look briefly at creating a token and breaking it into subtokens:
Listing 1. A first look at the nltk.tokenizer.Token class
>>> from nltk.tokenizer import * >>> t = Token(TEXT='This is my first test sentence') >>> WSTokenizer().tokenize(t, addlocs=True) # break on whitespace >>> print t['TEXT'] This is my first test sentence >>> print t['SUBTOKENS'] [<This>@[0:4c], <is>@[5:7c], <my>@[8:10c], <first>@[11:16c], <test>@[17:21c], <sentence>@[22:30c]] >>> t['foo'] = 'bar' >>> t <TEXT='This is my first test sentence', foo='bar', SUBTOKENS=[<This>@[0:4c], <is>@[5:7c], <my>@[8:10c], <first>@[11:16c], <test>@[17:21c], <sentence>@[22:30c]]> >>> print t['SUBTOKENS'][0] <This>@[0:4c] >>> print type(t['SUBTOKENS'][0]) <class 'nltk.token.SafeToken'> |
One fairly simple thing you are likely to do with linguistic corpora is analyze frequencies of various events within them, and make probability predictions based on these known frequencies. NLTK supports a variety of techniques for projecting probabilities based on raw frequency data. I will not cover those here (see the Probability tutorial listed in Resources), but suffice it to say that what you are warranted to expect has a slightly fuzzy relationship to what you already know (beyond the obvious scaling/normalization).
In essence, there are two types of frequency supported by NLTK:
histograms and conditional frequencies. The class
nltk.probability.FreqDist is used to create histograms; for example,
a word histogram can be created with:
Listing 2. Basic histogram creation with nltk.probability.FreqDist
>>> from nltk.probability import *
>>> article = Token(TEXT=open('cp-b17.txt').read())
>>> WSTokenizer().tokenize(article)
>>> freq = FreqDist()
>>> for word in article['SUBTOKENS']:
... freq.inc(word['TEXT'])
>>> freq.B()
1194
>>> freq.count('Python')
12
|
The Probability tutorial discusses creation of histograms on more complex
features, like "the length of words following words ending in vowels."
The class nltk.draw.plot.Plot is useful for visualization of
histograms. Of course, you can equally analyze frequencies of
higher-level grammatical features, or even of data sets unrelated to NLTK.
Conditional frequencies are perhaps more interesting than plain histograms. A conditional frequency is a kind of two-dimensional histogram -- it gives you one histogram per initial condition, or "context." For example, the tutorial suggests the question of what the distribution of word lengths is for each first letter. We can analyze that with:
Listing 3. Conditional frequencies: Word length per initial letter
>>> cf = ConditionalFreqDist() >>> for word in article['SUBTOKENS']: ... cf[word['TEXT'][0]].inc(len(word['TEXT'])) ... >>> init_letters = cf.conditions() >>> init_letters.sort() >>> for c in init_letters[44:50]: ... print "Init %s:" % c, ... for length in range(1,6): ... print "len %d/%.2f," % (length,cf[c].freq(n)), ... print ... Init a: len 1/0.03, len 2/0.03, len 3/0.03, len 4/0.03, len 5/0.03, Init b: len 1/0.12, len 2/0.12, len 3/0.12, len 4/0.12, len 5/0.12, Init c: len 1/0.06, len 2/0.06, len 3/0.06, len 4/0.06, len 5/0.06, Init d: len 1/0.06, len 2/0.06, len 3/0.06, len 4/0.06, len 5/0.06, Init e: len 1/0.18, len 2/0.18, len 3/0.18, len 4/0.18, len 5/0.18, Init f: len 1/0.25, len 2/0.25, len 3/0.25, len 4/0.25, len 5/0.25, |
A nice linguistic use of conditional frequencies is to analyze the syntagmatic distributions in corpora -- for example, given the occurrence of a particular word, what words are most likely to come next. Grammars provide some constraints here, of course; but the study of selection among syntactic options falls within the fields of semantics, pragmatics, and register.
The class nltk.stemmer.porter.PorterStemmer is a wonderfully handy
tool to derive grammatical (prefix) stems from English words. This
capability struck a particular chord for me, having previously created
a public-domain, full-text indexed search tool/library in Python
(described in Developing
a full-text indexer in Python, and used by a moderately large
number of other projects).
While it is quite useful to be able to search a large collection of
documents almost instantly for a joint occurrence of a collection of
exact words (what gnosis.indexer does), for many searching purposes,
a little fuzziness would help. Perhaps you are not quite sure whether
the old e-mail you are looking for used the word "complicated,"
"complications," "complicating," or "complicates," but you remember
that was one of the general concepts involved (probably with a few
others to perform a useful search).
NLTK includes an excellent algorithm for word stemming, and lets you customize stemming algorithms further to your liking:
Listing 4. Stemming words for morphological roots
>>> from nltk.stemmer.porter import PorterStemmer
>>> PorterStemmer().stem_word('complications')
'complic'
|
Exactly how you might utilize stemming within gnosis.indexer, a tool derived from it, or a wholly different indexing tool, depends on your usage scenarios. Fortunately, gnosis.indexer has an open interface that is easy to specialize. Do you want an index composed entirely of stems? Or do you want to include both full words and stems in the index? Are matches on exact words to be ranked higher than matches on stems? Do you want to separate stem matches from exact matches in your results? I will include some sort of stemming capability in future versions of gnosis.indexer, but end users might still want to customize differently.
In general, however, adding stemming is very simple: First, derive
stems from a document by specializing gnosis.indexer.TextSplitter;
second, when you perform a search, (optionally) stem the search terms
before using them for index lookup, probably by customizing your
MyIndexer.find() method.
I made the discovery, while playing with the PorterStemmer, that the class
nltk.tokenizer.WSTokenizer really is as bad
as NLTK's tutorial
warns. It is fine to occupy a conceptual role, but for real-life
texts, you can do a lot better at identifying what a "word" is.
Fortunately, gnosis.indexer.TextSplitter is a robust
tokenizer. For example:
Listing 5. Stemming based on poor NLTK tokenization
>>> from nltk.tokenizer import *
>>> article = Token(TEXT=open('cp-b17.txt').read())
>>> WSTokenizer().tokenize(article)
>>> from nltk.probability import *
>>> from nltk.stemmer.porter import *
>>> stemmer = PorterStemmer()
>>> stems = FreqDist()
>>> for word in article['SUBTOKENS']:
... stemmer.stem(word)
... stems.inc(word['STEM'].lower())
...
>>> word_stems = stems.samples()
>>> word_stems.sort()
>>> word_stems[20:40]
['"generator-bas', '"implement', '"lazili', '"magic"', '"partial',
'"pluggable"', '"primitives"', '"repres', '"secur', '"semi-coroutines."',
'"state', '"understand', '"weightless', '"whatev', '#', '#-----',
'#----------', '#-------------', '#---------------', '#b17:']
|
Looking at a few stems, the collection does not look all that useful for indexing. Many are not really words at all, while others are compound phrases with dashes, and extraneous punctuation makes it in with the words. Let's try it with better tokenization:
Listing 6. Stemming using clever heuristics in tokenization
>>> article = TS().text_splitter(open('cp-b17.txt').read())
>>> stems = FreqDist()
>>> for word in article:
... stems.inc(stemmer.stem_word(word.lower()))
...
>>> word_stems = stems.samples()
>>> word_stems.sort()
>>> word_stems[60:80]
['bool', 'both', 'boundari', 'brain', 'bring', 'built', 'but', 'byte',
'call', 'can', 'cannot', 'capabl', 'capit', 'carri', 'case', 'cast',
'certain', 'certainli', 'chang', 'charm']
|
Here you can see several words that have several possible expansions,
and all the words look like words, or like morphemes. Tokenization
matters a lot for random text collections; in fairness to NLTK, its
bundled corpora have been packaged for easy and accurate tokenization
with WSTokenizer(). For a robust real-world indexer, use robust
tokenization.
Tagging, chunking, and parsing
The largest part of NLTK consists of various parsers, of varying levels of sophistication. For the most part, this introduction will not explain their details, but I would like to give a first brush at what they hope to achieve.
As background, remember that tokens are special dictionaries -- in
particular, ones that can contain a key TAG to indicate the
grammatical role of a word. NLTK corpora documents often come
pre-tagged for parts of speech, but you can certainly add your own
tags to untagged documents.
Chunking is something like "parsing lite." That is, chunking is based either on existing markup of grammatical components, or is something you add manually -- or semi-automatically using regular expressions and program logic. But it is not really parsing, properly speaking (no production rules as such). For example:
Listing 7. Chunk parsing/tagging: words and and bigger bits
>>> from nltk.parser.chunk import ChunkedTaggedTokenizer >>> chunked = "[ the/DT little/JJ cat/NN ] sat/VBD on/IN [ the/DT mat/NN ]" >>> sentence = Token(TEXT=chunked) >>> tokenizer = ChunkedTaggedTokenizer(chunk_node='NP') >>> tokenizer.tokenize(sentence) >>> sentence['SUBTOKENS'][0] (NP: <the/DT> <little/JJ> <cat/NN>) >>> sentence['SUBTOKENS'][0]['NODE'] 'NP' >>> sentence['SUBTOKENS'][0]['CHILDREN'][0] <the/DT> >>> sentence['SUBTOKENS'][0]['CHILDREN'][0]['TAG'] 'DT' >>> chunk_structure = TreeToken(NODE='S', CHILDREN=sentence['SUBTOKENS']) (S: (NP: <the/DT> <little/JJ> <cat/NN>) <sat/VBD> <on/IN> (NP: <the/DT> <mat/NN>)) |
Chunking, as mentioned, can be done with the class
nltk.tokenizer.RegexpChunkParser using pseudo-regular-expressions to
describe series of tags making up a grammatical element. Here's an
example from the Probability tutorial:
Listing 8. Chunking with regular expressions on tags
>>> rule1 = ChunkRule('<DT>?<JJ.*>*<NN.*>',
... 'Chunk optional det, zero or more adj, and a noun')
>>> chunkparser = RegexpChunkParser([rule1], chunk_node='NP', top_node='S')
>>> chunkparser.parse(sentence)
>>> print sent['TREE']
(S: (NP: <the/DT> <little/JJ> <cat/NN>)
<sat/VBD> <on/IN>
(NP: <the/DT> <mat/NN>))
|
True parsing gets us into a lot of theoretical areas. For example,
top-down parsers are guaranteed to find every possible production, but
can be extremely slow because of frequence (exponention order)
backtracking. Shift-reduce parsing is much more efficient, but can
miss some productions. In either case, a grammar is declared in a
manner similar to those created to parse artificial languages. This
column has looked at some of those: SimpleParse, mx.TextTools,
Spark, and gnosis.xml.validity (see Resources).
Even beyond top-down and shift-reduce parser, NLTK also offers "chart parsers" that create partial hypotheses that a given sequence can be continued to fulfill a rule. This approach can be both efficient and complete. A quick (toy) example illustrates:
Listing 9. Defining basic productions for a context-free grammar
>>> from nltk.parser.chart import *
>>> grammar = CFG.parse('''
... S -> NP VP
... VP -> V NP | VP PP
... V -> "saw" | "ate"
... NP -> "John" | "Mary" | "Bob" | Det N | NP PP
... Det -> "a" | "an" | "the" | "my"
... N -> "dog" | "cat" | "cookie"
... PP -> P NP
... P -> "on" | "by" | "with"
... ''')
>>> sentence = Token(TEXT='John saw a cat with my cookie')
>>> WSTokenizer().tokenize(sentence)
>>> parser = ChartParser(grammar, BU_STRATEGY, LEAF='TEXT')
>>> parser.parse_n(sentence)
>>> for tree in sentence['TREES']: print tree
(S:
(NP: <John>)
(VP:
(VP: (V: <saw>) (NP: (Det: <a>) (N: <cat>)))
(PP: (P: <with>) (NP: (Det: <my>) (N: <cookie>)))))
(S:
(NP: <John>)
(VP:
(V: <saw>)
(NP:
(NP: (Det: <a>) (N: <cat>))
(PP: (P: <with>) (NP: (Det: <my>) (N: <cookie>))))))
|
A probabilistic context-free grammar (or PCFG) is a context-free grammar that associates a probability with each of its productions. Again, parsers for probabilistic parsing are also bundled with NLTK.
NLTK has other important features that this brief introduction could not get to. For example, NLTK has a whole framework for text classification using statistical techniques like "naive Bayesian" and "maximum entropy" models. Heady stuff that I cannot yet explain, even if I had space. But I think even NLTK's lower layers make it a useful framework for both pedagogical and practical applications.
- The Natural Language
Toolkit (NLTK) is hosted by Sourceforge, and both its
expressions home page and associated documentation, downloads, and
various other resources can be found there. The documentation for NLTK is accessible through the project home page. From there you
can find API reference guides to several versions of the library. At the
time of this writing, 1.3 was stable and 1.4 was in alpha; but when you read
it, later versions may exist.
- Of particular use to the new user of NLTK -- including the author, as he wrote this article -- are the series of nine
tutorials on NTLK. As of this writing, they generally cover respective
subpackages of NLTK; David particularly enjoyed a Probability tutorial on modelling
probablistic systems. Three supplementary tutorials introduce Python
more generally for linguistics students who may not already know the
language (or for other folks). These tutorials are helpful and
well-written, but occasional details seem not to match the most current
API version.
- An earlier Charming Python column covered Developing
a full-text indexer in Python and presented the tool
gnosis.indexer. - Other Charming Python installments have looked at parsers for artificial
languages:
- Parsing with the SimpleParse module
- Parsing with the Spark module
- Creating declarative mini-languages
- Browse all of the Charming Python columns on developerWorks.
- David's Text Processing in
Python (Addison Wesley, 2003) introduces
regular expressions, formal parsers, text manipulation, and even state
machines. It is a good start for working your way up to computational
linguistics. Moreover, David was delighted to discover during researching
this article that one utility function from his book was incorporated into
the NLTK package.
- Learn more about computational linguistics from the Computational Linguistics
FAQ and the quarterly ACL journal Computational
Linguistics.
- Wikipedia, the free encyclopedia, offers an overview of Computational
linguistics and context-free
grammars with, as always, good links to more information about both.
- Context-free (and other formal) grammars are pretty complex. See also
Context-Free
Grammars and Parsing from the Eli Project, as well as the Wikipedia
article on the Chomsky
hierarchy.
- Find more resources for Linux™ developers in the developerWorks Linux
zone.
- Browse for books on these and other technical topics.
- Develop and test your Linux applications using the latest IBM tools
and middleware with a developerWorks Subscription: you get IBM software from
WebSphere®, DB2®, Lotus®, Rational®, and Tivoli®, and a license to use the
software for 12 months, all for less money than you might think.
- Download no-charge trial versions of selected developerWorks
Subscription products that run on Linux, including WebSphere Studio Site
Developer, WebSphere SDK for Web services, WebSphere Application Server,
DB2 Universal Database Personal Developers Edition, Tivoli Access Manager,
and Lotus Domino Server, from the Speed-start
your Linux app section of developerWorks. For an even speedier start,
help yourself to a product-by-product collection of how-to articles and
tech support.
David Mertz had no idea he was writing prose this whole time. You can reach David at mertz@gnosis.cx; you can investigate all aspects of his life at his personal Web page. Check out his book, Text Processing in Python. Suggestions and recommendations on past or future columns are welcome.





