Get started with the Natural Language Toolkit
Using Python in computational linguistics
This content is part # of # in the series: Charming Python
This content is part of the series:Charming Python
Stay tuned for additional content in this series.
Your humble writer knows a little bit about a lot of things, but despite writing a fair amount about text processing (a book, for example), linguistic processing is a relatively novel area for me. Forgive me if I stumble through my explanations of the quite remarkable Natural Language Toolkit (NLTK), a wonderful tool for teaching, and working in, computational linguistics using Python. Computational linguistics, moreover, is closely related to the fields of artificial intelligence, language/speech recognition, translation, and grammar checking.
What NLTK includes
It is natural to think of NLTK as a stacked series of layers that build on each other. Readers familiar with lexing and parsing of artificial languages (like, say, Python) will not have too much of a leap to understand the similar -- but deeper -- layers involved in natural language modeling.
While NLTK comes with a number of corpora that have been pre-processed (often manually) to various degrees, conceptually each layer relies on the processing in the adjacent lower layer. Tokenization comes first; then words are tagged; then groups of words are parsed into grammatical elements, like noun phrases or sentences (according to one of several techniques, each with advantages and drawbacks); and finally sentences or other grammatical units can be classified. Along the way, NLTK gives you the ability to generate statistics about occurrences of various elements, and draw graphs that represent either the processing itself, or statistical aggregates in results.
In this article, you'll see some relatively fleshed-out examples from the lower-level capabilities, but most of the higher-level capabilities will be simply described abstractly. Let's now take the first steps past text processing, narrowly construed.
Much of what you can do with NLTK, particularly at its lower levels, is not that much different from what you can do with Python's basic data structures. But NLTK provides a set of regularized interfaces that are relied on and utilized at higher levels, in addition to simply providing convenient classes to hold tokenized and tagged texts.
In particular, the class
nltk.tokenizer.Token is used widely to store
annotated segments of text; these annotations can mark a number of different
features, including parts-of-speech, subtoken structures, offsets of a token within
a larger text, morphological stems, grammatical sentence components, and so on. In
Token is a special kind of dictionary -- and is accessed in the
fashion of a dictionary -- so it can contain whatever keys you like. A few special
keys are used in NLTK, different ones by the various subpackages.
Let's look briefly at creating a token and breaking it into subtokens:
Listing 1. A first look at the nltk.tokenizer.Token class
>>> from nltk.tokenizer import * >>> t = Token(TEXT='This is my first test sentence') >>> WSTokenizer().tokenize(t, addlocs=True) # break on whitespace >>> print t['TEXT'] This is my first test sentence >>> print t['SUBTOKENS'] [<This>@[0:4c], <is>@[5:7c], <my>@[8:10c], <first>@[11:16c], <test>@[17:21c], <sentence>@[22:30c]] >>> t['foo'] = 'bar' >>> t <TEXT='This is my first test sentence', foo='bar', SUBTOKENS=[<This>@[0:4c], <is>@[5:7c], <my>@[8:10c], <first>@[11:16c], <test>@[17:21c], <sentence>@[22:30c]]> >>> print t['SUBTOKENS'] <This>@[0:4c] >>> print type(t['SUBTOKENS']) <class 'nltk.token.SafeToken'>
One fairly simple thing you are likely to do with linguistic corpora is analyze frequencies of various events within them, and make probability predictions based on these known frequencies. NLTK supports a variety of techniques for projecting probabilities based on raw frequency data. I will not cover those here (see the Probability tutorial listed in Related topics), but suffice it to say that what you are warranted to expect has a slightly fuzzy relationship to what you already know (beyond the obvious scaling/normalization).
In essence, there are two types of frequency supported by NLTK: histograms and
conditional frequencies. The class
nltk.probability.FreqDist is used to
create histograms; for example, a word histogram can be created with:
Listing 2. Basic histogram creation with nltk.probability.FreqDist
>>> from nltk.probability import * >>> article = Token(TEXT=open('cp-b17.txt').read()) >>> WSTokenizer().tokenize(article) >>> freq = FreqDist() >>> for word in article['SUBTOKENS']: ... freq.inc(word['TEXT']) >>> freq.B() 1194 >>> freq.count('Python') 12
The Probability tutorial discusses creation of histograms on more complex features,
like "the length of words following words ending in vowels." The class
nltk.draw.plot.Plot is useful for visualization of histograms. Of
course, you can equally analyze frequencies of higher-level grammatical features, or
even of data sets unrelated to NLTK.
Conditional frequencies are perhaps more interesting than plain histograms. A conditional frequency is a kind of two-dimensional histogram -- it gives you one histogram per initial condition, or "context." For example, the tutorial suggests the question of what the distribution of word lengths is for each first letter. We can analyze that with:
Listing 3. Conditional frequencies: Word length per initial letter
>>> cf = ConditionalFreqDist() >>> for word in article['SUBTOKENS']: ... cf[word['TEXT']].inc(len(word['TEXT'])) ... >>> init_letters = cf.conditions() >>> init_letters.sort() >>> for c in init_letters[44:50]: ... print "Init %s:" % c, ... for length in range(1,6): ... print "len %d/%.2f," % (length,cf[c].freq(n)), ... print ... Init a: len 1/0.03, len 2/0.03, len 3/0.03, len 4/0.03, len 5/0.03, Init b: len 1/0.12, len 2/0.12, len 3/0.12, len 4/0.12, len 5/0.12, Init c: len 1/0.06, len 2/0.06, len 3/0.06, len 4/0.06, len 5/0.06, Init d: len 1/0.06, len 2/0.06, len 3/0.06, len 4/0.06, len 5/0.06, Init e: len 1/0.18, len 2/0.18, len 3/0.18, len 4/0.18, len 5/0.18, Init f: len 1/0.25, len 2/0.25, len 3/0.25, len 4/0.25, len 5/0.25,
A nice linguistic use of conditional frequencies is to analyze the syntagmatic distributions in corpora -- for example, given the occurrence of a particular word, what words are most likely to come next. Grammars provide some constraints here, of course; but the study of selection among syntactic options falls within the fields of semantics, pragmatics, and register.
nltk.stemmer.porter.PorterStemmer is a wonderfully handy tool
to derive grammatical (prefix) stems from English words. This capability struck a
particular chord for me, having previously created a public-domain, full-text
indexed search tool/library in Python and used by a moderately large number of other
While it is quite useful to be able to search a large collection of documents almost
instantly for a joint occurrence of a collection of exact words (what
gnosis.indexer does), for many searching purposes, a little
fuzziness would help. Perhaps you are not quite sure whether the old e-mail you are
looking for used the word "complicated," "complications," "complicating," or
"complicates," but you remember that was one of the general concepts involved
(probably with a few others to perform a useful search).
NLTK includes an excellent algorithm for word stemming, and lets you customize stemming algorithms further to your liking:
Listing 4. Stemming words for morphological roots
>>> from nltk.stemmer.porter import PorterStemmer >>> PorterStemmer().stem_word('complications') 'complic'
Exactly how you might utilize stemming within gnosis.indexer, a tool derived from it, or a wholly different indexing tool, depends on your usage scenarios. Fortunately, gnosis.indexer has an open interface that is easy to specialize. Do you want an index composed entirely of stems? Or do you want to include both full words and stems in the index? Are matches on exact words to be ranked higher than matches on stems? Do you want to separate stem matches from exact matches in your results? I will include some sort of stemming capability in future versions of gnosis.indexer, but end users might still want to customize differently.
In general, however, adding stemming is very simple: First, derive stems from a
document by specializing
gnosis.indexer.TextSplitter; second, when you
perform a search, (optionally) stem the search terms before using them for index
lookup, probably by customizing your
I made the discovery, while playing with the
PorterStemmer, that the
nltk.tokenizer.WSTokenizer really is as bad as NLTK's tutorial
warns. It is fine to occupy a conceptual role, but for real-life texts, you can do a
lot better at identifying what a "word" is. Fortunately,
gnosis.indexer.TextSplitter is a robust tokenizer. For example:
Listing 5. Stemming based on poor NLTK tokenization
>>> from nltk.tokenizer import * >>> article = Token(TEXT=open('cp-b17.txt').read()) >>> WSTokenizer().tokenize(article) >>> from nltk.probability import * >>> from nltk.stemmer.porter import * >>> stemmer = PorterStemmer() >>> stems = FreqDist() >>> for word in article['SUBTOKENS']: ... stemmer.stem(word) ... stems.inc(word['STEM'].lower()) ... >>> word_stems = stems.samples() >>> word_stems.sort() >>> word_stems[20:40] ['"generator-bas', '"implement', '"lazili', '"magic"', '"partial', '"pluggable"', '"primitives"', '"repres', '"secur', '"semi-coroutines."', '"state', '"understand', '"weightless', '"whatev', '#', '#-----', '#----------', '#-------------', '#---------------', '#b17:']
Looking at a few stems, the collection does not look all that useful for indexing. Many are not really words at all, while others are compound phrases with dashes, and extraneous punctuation makes it in with the words. Let's try it with better tokenization:
Listing 6. Stemming using clever heuristics in tokenization
>>> article = TS().text_splitter(open('cp-b17.txt').read()) >>> stems = FreqDist() >>> for word in article: ... stems.inc(stemmer.stem_word(word.lower())) ... >>> word_stems = stems.samples() >>> word_stems.sort() >>> word_stems[60:80] ['bool', 'both', 'boundari', 'brain', 'bring', 'built', 'but', 'byte', 'call', 'can', 'cannot', 'capabl', 'capit', 'carri', 'case', 'cast', 'certain', 'certainli', 'chang', 'charm']
Here you can see several words that have several possible expansions, and all the
words look like words, or like morphemes. Tokenization matters a lot for random text
collections; in fairness to NLTK, its bundled corpora have been packaged for easy
and accurate tokenization with
WSTokenizer(). For a robust real-world
indexer, use robust tokenization.
Tagging, chunking, and parsing
The largest part of NLTK consists of various parsers, of varying levels of sophistication. For the most part, this introduction will not explain their details, but I would like to give a first brush at what they hope to achieve.
As background, remember that tokens are special dictionaries -- in particular, ones
that can contain a key
TAG to indicate the grammatical role of a word.
NLTK corpora documents often come pre-tagged for parts of speech, but you can
certainly add your own tags to untagged documents.
Chunking is something like "parsing lite." That is, chunking is based either on existing markup of grammatical components, or is something you add manually -- or semi-automatically using regular expressions and program logic. But it is not really parsing, properly speaking (no production rules as such). For example:
Listing 7. Chunk parsing/tagging: words and and bigger bits
>>> from nltk.parser.chunk import ChunkedTaggedTokenizer >>> chunked = "[ the/DT little/JJ cat/NN ] sat/VBD on/IN [ the/DT mat/NN ]" >>> sentence = Token(TEXT=chunked) >>> tokenizer = ChunkedTaggedTokenizer(chunk_node='NP') >>> tokenizer.tokenize(sentence) >>> sentence['SUBTOKENS'] (NP: <the/DT> <little/JJ> <cat/NN>) >>> sentence['SUBTOKENS']['NODE'] 'NP' >>> sentence['SUBTOKENS']['CHILDREN'] <the/DT> >>> sentence['SUBTOKENS']['CHILDREN']['TAG'] 'DT' >>> chunk_structure = TreeToken(NODE='S', CHILDREN=sentence['SUBTOKENS']) (S: (NP: <the/DT> <little/JJ> <cat/NN>) <sat/VBD> <on/IN> (NP: <the/DT> <mat/NN>))
Chunking, as mentioned, can be done with the class
nltk.tokenizer.RegexpChunkParser using pseudo-regular-expressions
to describe series of tags making up a grammatical element. Here's an example from
the Probability tutorial:
Listing 8. Chunking with regular expressions on tags
>>> rule1 = ChunkRule('<DT>?<JJ.*>*<NN.*>', ... 'Chunk optional det, zero or more adj, and a noun') >>> chunkparser = RegexpChunkParser([rule1], chunk_node='NP', top_node='S') >>> chunkparser.parse(sentence) >>> print sent['TREE'] (S: (NP: <the/DT> <little/JJ> <cat/NN>) <sat/VBD> <on/IN> (NP: <the/DT> <mat/NN>))
True parsing gets us into a lot of theoretical areas. For example, top-down parsers
are guaranteed to find every possible production, but can be extremely slow because
of frequence (exponention order) backtracking. Shift-reduce parsing is much more
efficient, but can miss some productions. In either case, a grammar is declared in a
manner similar to those created to parse artificial languages. This column has
looked at some of those:
gnosis.xml.validity (see Related topics).
Even beyond top-down and shift-reduce parser, NLTK also offers "chart parsers" that create partial hypotheses that a given sequence can be continued to fulfill a rule. This approach can be both efficient and complete. A quick (toy) example illustrates:
Listing 9. Defining basic productions for a context-free grammar
>>> from nltk.parser.chart import * >>> grammar = CFG.parse(''' ... S -> NP VP ... VP -> V NP | VP PP ... V -> "saw" | "ate" ... NP -> "John" | "Mary" | "Bob" | Det N | NP PP ... Det -> "a" | "an" | "the" | "my" ... N -> "dog" | "cat" | "cookie" ... PP -> P NP ... P -> "on" | "by" | "with" ... ''') >>> sentence = Token(TEXT='John saw a cat with my cookie') >>> WSTokenizer().tokenize(sentence) >>> parser = ChartParser(grammar, BU_STRATEGY, LEAF='TEXT') >>> parser.parse_n(sentence) >>> for tree in sentence['TREES']: print tree (S: (NP: <John>) (VP: (VP: (V: <saw>) (NP: (Det: <a>) (N: <cat>))) (PP: (P: <with>) (NP: (Det: <my>) (N: <cookie>))))) (S: (NP: <John>) (VP: (V: <saw>) (NP: (NP: (Det: <a>) (N: <cat>)) (PP: (P: <with>) (NP: (Det: <my>) (N: <cookie>))))))
A probabilistic context-free grammar (or PCFG) is a context-free grammar that associates a probability with each of its productions. Again, parsers for probabilistic parsing are also bundled with NLTK.
What are you waiting for?
NLTK has other important features that this brief introduction could not get to. For example, NLTK has a whole framework for text classification using statistical techniques like "naive Bayesian" and "maximum entropy" models. Heady stuff that I cannot yet explain, even if I had space. But I think even NLTK's lower layers make it a useful framework for both pedagogical and practical applications.
- The Natural Language Toolkit (NLTK) is hosted by Sourceforge, and both its expressions home page and associated documentation, downloads, and various other resources can be found there. The documentation for NLTK is accessible through the project home page. From there you can find API reference guides to several versions of the library. At the time of this writing, 1.3 was stable and 1.4 was in alpha; but when you read it, later versions may exist.
- Other Charming Python installments have looked at parsers for artificial languages:
- Browse all of the Charming Python columns on developerWorks.
- David's Text Processing in Python (Addison Wesley, 2003) introduces regular expressions, formal parsers, text manipulation, and even state machines. It is a good start for working your way up to computational linguistics. Moreover, David was delighted to discover during researching this article that one utility function from his book was incorporated into the NLTK package.
- Learn more about computational linguistics from the Computational Linguistics FAQ and the quarterly ACL journal Computational Linguistics.
- Wikipedia, the free encyclopedia, offers an overview of Computational linguistics and context-free grammars with, as always, good links to more information about both.