Charming Python: Get started with the Natural Language Toolkit

Using Python in computational linguistics

In this installment, David introduces you to the Natural Language Toolkit, a Python library for applying academic linguistic techniques to collections of textual data. Programming that goes by the name "text processing" is a start; other capabilities for syntactic and even semantic analysis are further specialized to studying natural languages.

Share:

David Mertz, Developer, Gnosis Software, Inc.

David Mertz had no idea he was writing prose this whole time. You can reach David at mertz@gnosis.cx; you can investigate all aspects of his life at his personal Web page. Check out his book, Text Processing in Python. Suggestions and recommendations on past or future columns are welcome.



24 June 2004

Your humble writer knows a little bit about a lot of things, but despite writing a fair amount about text processing (a book, for example), linguistic processing is a relatively novel area for me. Forgive me if I stumble through my explanations of the quite remarkable Natural Language Toolkit (NLTK), a wonderful tool for teaching, and working in, computational linguistics using Python. Computational linguistics, moreover, is closely related to the fields of artificial intelligence, language/speech recognition, translation, and grammar checking.

What NLTK includes

It is natural to think of NLTK as a stacked series of layers that build on each other. Readers familiar with lexing and parsing of artificial languages (like, say, Python) will not have too much of a leap to understand the similar -- but deeper -- layers involved in natural language modeling.

Glossary of terms

Corpora: Collections of related texts. For example, the works of Shakespeare might, collectively, by called a corpus; the works of several authors, corpora.

Histogram: The statistic distribution of the frequency of different words, letters, or other items within a data set.

Syntagmatic: The study of syntagma; namely, the statistical relations in the contiguous occurrence of letters, words, or phrases in corpora.

Context-free grammar: Type-2 in Noam Chomsky's hierarchy of the four types of formal grammars. See Resources for a thorough description.

While NLTK comes with a number of corpora that have been pre-processed (often manually) to various degrees, conceptually each layer relies on the processing in the adjacent lower layer. Tokenization comes first; then words are tagged; then groups of words are parsed into grammatical elements, like noun phrases or sentences (according to one of several techniques, each with advantages and drawbacks); and finally sentences or other grammatical units can be classified. Along the way, NLTK gives you the ability to generate statistics about occurrences of various elements, and draw graphs that represent either the processing itself, or statistical aggregates in results.

In this article, you'll see some relatively fleshed-out examples from the lower-level capabilities, but most of the higher-level capabilities will be simply described abstractly. Let's now take the first steps past text processing, narrowly construed.


Tokenization

Much of what you can do with NLTK, particularly at its lower levels, is not that much different from what you can do with Python's basic data structures. But NLTK provides a set of regularized interfaces that are relied on and utilized at higher levels, in addition to simply providing convenient classes to hold tokenized and tagged texts.

In particular, the class nltk.tokenizer.Token is used widely to store annotated segments of text; these annotations can mark a number of different features, including parts-of-speech, subtoken structures, offsets of a token within a larger text, morphological stems, grammatical sentence components, and so on. In fact, a Token is a special kind of dictionary -- and is accessed in the fashion of a dictionary -- so it can contain whatever keys you like. A few special keys are used in NLTK, different ones by the various subpackages.

Let's look briefly at creating a token and breaking it into subtokens:

Listing 1. A first look at the nltk.tokenizer.Token class
>>> from nltk.tokenizer import *
>>> t = Token(TEXT='This is my first test sentence')
>>> WSTokenizer().tokenize(t, addlocs=True) # break on whitespace
>>> print t['TEXT']
This is my first test sentence
>>> print t['SUBTOKENS']
[<This>@[0:4c], <is>@[5:7c], <my>@[8:10c], <first>@[11:16c],
<test>@[17:21c], <sentence>@[22:30c]]
>>> t['foo'] = 'bar'
>>> t
<TEXT='This is my first test sentence', foo='bar',
SUBTOKENS=[<This>@[0:4c], <is>@[5:7c], <my>@[8:10c], <first>@[11:16c],
<test>@[17:21c], <sentence>@[22:30c]]>
>>> print t['SUBTOKENS'][0]
<This>@[0:4c]
>>> print type(t['SUBTOKENS'][0])
<class 'nltk.token.SafeToken'>

Probability

One fairly simple thing you are likely to do with linguistic corpora is analyze frequencies of various events within them, and make probability predictions based on these known frequencies. NLTK supports a variety of techniques for projecting probabilities based on raw frequency data. I will not cover those here (see the Probability tutorial listed in Resources), but suffice it to say that what you are warranted to expect has a slightly fuzzy relationship to what you already know (beyond the obvious scaling/normalization).

In essence, there are two types of frequency supported by NLTK: histograms and conditional frequencies. The class nltk.probability.FreqDist is used to create histograms; for example, a word histogram can be created with:

Listing 2. Basic histogram creation with nltk.probability.FreqDist
>>> from nltk.probability import *
>>> article = Token(TEXT=open('cp-b17.txt').read())
>>> WSTokenizer().tokenize(article)
>>> freq = FreqDist()
>>> for word in article['SUBTOKENS']:
...     freq.inc(word['TEXT'])
>>> freq.B()
1194
>>> freq.count('Python')
12

The Probability tutorial discusses creation of histograms on more complex features, like "the length of words following words ending in vowels." The class nltk.draw.plot.Plot is useful for visualization of histograms. Of course, you can equally analyze frequencies of higher-level grammatical features, or even of data sets unrelated to NLTK.

Conditional frequencies are perhaps more interesting than plain histograms. A conditional frequency is a kind of two-dimensional histogram -- it gives you one histogram per initial condition, or "context." For example, the tutorial suggests the question of what the distribution of word lengths is for each first letter. We can analyze that with:

Listing 3. Conditional frequencies: Word length per initial letter
>>> cf = ConditionalFreqDist()
>>> for word in article['SUBTOKENS']:
...     cf[word['TEXT'][0]].inc(len(word['TEXT']))
...
>>> init_letters = cf.conditions()
>>> init_letters.sort()
>>> for c in init_letters[44:50]:
...     print "Init %s:" % c,
...     for length in range(1,6):
...         print "len %d/%.2f," % (length,cf[c].freq(n)),
...     print
...
Init a: len 1/0.03, len 2/0.03, len 3/0.03, len 4/0.03, len 5/0.03,
Init b: len 1/0.12, len 2/0.12, len 3/0.12, len 4/0.12, len 5/0.12,
Init c: len 1/0.06, len 2/0.06, len 3/0.06, len 4/0.06, len 5/0.06,
Init d: len 1/0.06, len 2/0.06, len 3/0.06, len 4/0.06, len 5/0.06,
Init e: len 1/0.18, len 2/0.18, len 3/0.18, len 4/0.18, len 5/0.18,
Init f: len 1/0.25, len 2/0.25, len 3/0.25, len 4/0.25, len 5/0.25,

A nice linguistic use of conditional frequencies is to analyze the syntagmatic distributions in corpora -- for example, given the occurrence of a particular word, what words are most likely to come next. Grammars provide some constraints here, of course; but the study of selection among syntactic options falls within the fields of semantics, pragmatics, and register.


Stemming

The class nltk.stemmer.porter.PorterStemmer is a wonderfully handy tool to derive grammatical (prefix) stems from English words. This capability struck a particular chord for me, having previously created a public-domain, full-text indexed search tool/library in Python (described in Developing a full-text indexer in Python, and used by a moderately large number of other projects).

While it is quite useful to be able to search a large collection of documents almost instantly for a joint occurrence of a collection of exact words (what gnosis.indexer does), for many searching purposes, a little fuzziness would help. Perhaps you are not quite sure whether the old e-mail you are looking for used the word "complicated," "complications," "complicating," or "complicates," but you remember that was one of the general concepts involved (probably with a few others to perform a useful search).

NLTK includes an excellent algorithm for word stemming, and lets you customize stemming algorithms further to your liking:

Listing 4. Stemming words for morphological roots
>>> from nltk.stemmer.porter import PorterStemmer
>>> PorterStemmer().stem_word('complications')
'complic'

Exactly how you might utilize stemming within gnosis.indexer, a tool derived from it, or a wholly different indexing tool, depends on your usage scenarios. Fortunately, gnosis.indexer has an open interface that is easy to specialize. Do you want an index composed entirely of stems? Or do you want to include both full words and stems in the index? Are matches on exact words to be ranked higher than matches on stems? Do you want to separate stem matches from exact matches in your results? I will include some sort of stemming capability in future versions of gnosis.indexer, but end users might still want to customize differently.

In general, however, adding stemming is very simple: First, derive stems from a document by specializing gnosis.indexer.TextSplitter; second, when you perform a search, (optionally) stem the search terms before using them for index lookup, probably by customizing your MyIndexer.find() method.

I made the discovery, while playing with the PorterStemmer, that the class nltk.tokenizer.WSTokenizer really is as bad as NLTK's tutorial warns. It is fine to occupy a conceptual role, but for real-life texts, you can do a lot better at identifying what a "word" is. Fortunately, gnosis.indexer.TextSplitter is a robust tokenizer. For example:

Listing 5. Stemming based on poor NLTK tokenization
>>> from nltk.tokenizer import *
>>> article = Token(TEXT=open('cp-b17.txt').read())
>>> WSTokenizer().tokenize(article)
>>> from nltk.probability import *
>>> from nltk.stemmer.porter import *
>>> stemmer = PorterStemmer()
>>> stems = FreqDist()
>>> for word in article['SUBTOKENS']:
...     stemmer.stem(word)
...     stems.inc(word['STEM'].lower())
...
>>> word_stems = stems.samples()
>>> word_stems.sort()
>>> word_stems[20:40]
['"generator-bas', '"implement', '"lazili', '"magic"', '"partial',
'"pluggable"', '"primitives"', '"repres', '"secur', '"semi-coroutines."',
'"state', '"understand', '"weightless', '"whatev', '#', '#-----',
'#----------', '#-------------', '#---------------', '#b17:']

Looking at a few stems, the collection does not look all that useful for indexing. Many are not really words at all, while others are compound phrases with dashes, and extraneous punctuation makes it in with the words. Let's try it with better tokenization:

Listing 6. Stemming using clever heuristics in tokenization
>>> article = TS().text_splitter(open('cp-b17.txt').read())
>>> stems = FreqDist()
>>> for word in article:
...     stems.inc(stemmer.stem_word(word.lower()))
...
>>> word_stems = stems.samples()
>>> word_stems.sort()
>>> word_stems[60:80]
['bool', 'both', 'boundari', 'brain', 'bring', 'built', 'but', 'byte',
'call', 'can', 'cannot', 'capabl', 'capit', 'carri', 'case', 'cast',
'certain', 'certainli', 'chang', 'charm']

Here you can see several words that have several possible expansions, and all the words look like words, or like morphemes. Tokenization matters a lot for random text collections; in fairness to NLTK, its bundled corpora have been packaged for easy and accurate tokenization with WSTokenizer(). For a robust real-world indexer, use robust tokenization.


Tagging, chunking, and parsing

The largest part of NLTK consists of various parsers, of varying levels of sophistication. For the most part, this introduction will not explain their details, but I would like to give a first brush at what they hope to achieve.

As background, remember that tokens are special dictionaries -- in particular, ones that can contain a key TAG to indicate the grammatical role of a word. NLTK corpora documents often come pre-tagged for parts of speech, but you can certainly add your own tags to untagged documents.

Chunking is something like "parsing lite." That is, chunking is based either on existing markup of grammatical components, or is something you add manually -- or semi-automatically using regular expressions and program logic. But it is not really parsing, properly speaking (no production rules as such). For example:

Listing 7. Chunk parsing/tagging: words and and bigger bits
>>> from nltk.parser.chunk import ChunkedTaggedTokenizer
>>> chunked = "[ the/DT little/JJ cat/NN ] sat/VBD on/IN [ the/DT mat/NN ]"
>>> sentence = Token(TEXT=chunked)
>>> tokenizer = ChunkedTaggedTokenizer(chunk_node='NP')
>>> tokenizer.tokenize(sentence)
>>> sentence['SUBTOKENS'][0]
(NP: <the/DT> <little/JJ> <cat/NN>)
>>> sentence['SUBTOKENS'][0]['NODE']
'NP'
>>> sentence['SUBTOKENS'][0]['CHILDREN'][0]
<the/DT>
>>> sentence['SUBTOKENS'][0]['CHILDREN'][0]['TAG']
'DT'
>>> chunk_structure = TreeToken(NODE='S', CHILDREN=sentence['SUBTOKENS'])
(S:
  (NP: <the/DT> <little/JJ> <cat/NN>)
  <sat/VBD>
  <on/IN>
  (NP: <the/DT> <mat/NN>))

Chunking, as mentioned, can be done with the class nltk.tokenizer.RegexpChunkParser using pseudo-regular-expressions to describe series of tags making up a grammatical element. Here's an example from the Probability tutorial:

Listing 8. Chunking with regular expressions on tags
>>> rule1 = ChunkRule('<DT>?<JJ.*>*<NN.*>',
...               'Chunk optional det, zero or more adj, and a noun')
>>> chunkparser = RegexpChunkParser([rule1], chunk_node='NP', top_node='S')
>>> chunkparser.parse(sentence)
>>> print sent['TREE']
(S: (NP: <the/DT> <little/JJ> <cat/NN>)
  <sat/VBD> <on/IN>
  (NP: <the/DT> <mat/NN>))

True parsing gets us into a lot of theoretical areas. For example, top-down parsers are guaranteed to find every possible production, but can be extremely slow because of frequence (exponention order) backtracking. Shift-reduce parsing is much more efficient, but can miss some productions. In either case, a grammar is declared in a manner similar to those created to parse artificial languages. This column has looked at some of those: SimpleParse, mx.TextTools, Spark, and gnosis.xml.validity (see Resources).

Even beyond top-down and shift-reduce parser, NLTK also offers "chart parsers" that create partial hypotheses that a given sequence can be continued to fulfill a rule. This approach can be both efficient and complete. A quick (toy) example illustrates:

Listing 9. Defining basic productions for a context-free grammar
>>> from nltk.parser.chart import *
>>> grammar = CFG.parse('''
...    S -> NP VP
...    VP -> V NP | VP PP
...    V -> "saw" | "ate"
...    NP -> "John" | "Mary" | "Bob" | Det N | NP PP
...    Det -> "a" | "an" | "the" | "my"
...    N -> "dog" | "cat" | "cookie"
...    PP -> P NP
...    P -> "on" | "by" | "with"
...    ''')
>>> sentence = Token(TEXT='John saw a cat with my cookie')
>>> WSTokenizer().tokenize(sentence)
>>> parser = ChartParser(grammar, BU_STRATEGY, LEAF='TEXT')
>>> parser.parse_n(sentence)
>>> for tree in sentence['TREES']: print tree
(S:
  (NP: <John>)
  (VP:
    (VP: (V: <saw>) (NP: (Det: <a>) (N: <cat>)))
    (PP: (P: <with>) (NP: (Det: <my>) (N: <cookie>)))))
(S:
  (NP: <John>)
  (VP:
    (V: <saw>)
    (NP:
      (NP: (Det: <a>) (N: <cat>))
      (PP: (P: <with>) (NP: (Det: <my>) (N: <cookie>))))))

A probabilistic context-free grammar (or PCFG) is a context-free grammar that associates a probability with each of its productions. Again, parsers for probabilistic parsing are also bundled with NLTK.


What are you waiting for?

NLTK has other important features that this brief introduction could not get to. For example, NLTK has a whole framework for text classification using statistical techniques like "naive Bayesian" and "maximum entropy" models. Heady stuff that I cannot yet explain, even if I had space. But I think even NLTK's lower layers make it a useful framework for both pedagogical and practical applications.

Resources

  • The Natural Language Toolkit (NLTK) is hosted by Sourceforge, and both its expressions home page and associated documentation, downloads, and various other resources can be found there. The documentation for NLTK is accessible through the project home page. From there you can find API reference guides to several versions of the library. At the time of this writing, 1.3 was stable and 1.4 was in alpha; but when you read it, later versions may exist.
  • Of particular use to the new user of NLTK -- including the author, as he wrote this article -- are the series of nine tutorials on NTLK. As of this writing, they generally cover respective subpackages of NLTK; David particularly enjoyed a Probability tutorial on modelling probablistic systems. Three supplementary tutorials introduce Python more generally for linguistics students who may not already know the language (or for other folks). These tutorials are helpful and well-written, but occasional details seem not to match the most current API version.
  • An earlier Charming Python column covered Developing a full-text indexer in Python and presented the tool gnosis.indexer.
  • Other Charming Python installments have looked at parsers for artificial languages:
  • Browse all of the Charming Python columns on developerWorks.
  • David's Text Processing in Python (Addison Wesley, 2003) introduces regular expressions, formal parsers, text manipulation, and even state machines. It is a good start for working your way up to computational linguistics. Moreover, David was delighted to discover during researching this article that one utility function from his book was incorporated into the NLTK package.
  • Learn more about computational linguistics from the Computational Linguistics FAQ and the quarterly ACL journal Computational Linguistics.
  • Wikipedia, the free encyclopedia, offers an overview of Computational linguistics and context-free grammars with, as always, good links to more information about both.
  • Context-free (and other formal) grammars are pretty complex. See also Context-Free Grammars and Parsing from the Eli Project, as well as the Wikipedia article on the Chomsky hierarchy.
  • Find more resources for Linux™ developers in the developerWorks Linux zone.
  • Browse for books on these and other technical topics.
  • Develop and test your Linux applications using the latest IBM tools and middleware with a developerWorks Subscription: you get IBM software from WebSphere®, DB2®, Lotus®, Rational®, and Tivoli®, and a license to use the software for 12 months, all for less money than you might think.
  • Download no-charge trial versions of selected developerWorks Subscription products that run on Linux, including WebSphere Studio Site Developer, WebSphere SDK for Web services, WebSphere Application Server, DB2 Universal Database Personal Developers Edition, Tivoli Access Manager, and Lotus Domino Server, from the Speed-start your Linux app section of developerWorks. For an even speedier start, help yourself to a product-by-product collection of how-to articles and tech support.

Comments

developerWorks: Sign in

Required fields are indicated with an asterisk (*).


Need an IBM ID?
Forgot your IBM ID?


Forgot your password?
Change your password

By clicking Submit, you agree to the developerWorks terms of use.

 


The first time you sign into developerWorks, a profile is created for you. Information in your profile (your name, country/region, and company name) is displayed to the public and will accompany any content you post, unless you opt to hide your company name. You may update your IBM account at any time.

All information submitted is secure.

Choose your display name



The first time you sign in to developerWorks, a profile is created for you, so you need to choose a display name. Your display name accompanies the content you post on developerWorks.

Please choose a display name between 3-31 characters. Your display name must be unique in the developerWorks community and should not be your email address for privacy reasons.

Required fields are indicated with an asterisk (*).

(Must be between 3 – 31 characters.)

By clicking Submit, you agree to the developerWorks terms of use.

 


All information submitted is secure.

Dig deeper into Linux on developerWorks


static.content.url=http://www.ibm.com/developerworks/js/artrating/
SITE_ID=1
Zone=Linux
ArticleID=11402
ArticleTitle=Charming Python: Get started with the Natural Language Toolkit
publish-date=06242004