Python text processing with NLTK 2.0: creating custom corpora

In this article, we'll cover how to use corpus readers and create custom corpora. At the same time, you'll learn how to use the existing corpus data that comes with NLTK. We'll also cover creating custom corpus readers, which can be used when your corpus is not in a file format that NLTK already recognizes, or if your corpus is not in files at all, but instead is located in a database such as MongoDB.

Setting up a custom corpus

A corpus is a collection of text documents, and corpora is the plural of corpus. So a custom corpus is really just a bunch of text files in a directory, often alongside many other directories of text files.

Getting ready

You should already have the NLTK data package installed, following the instructions at http://www.nltk.org/data. We'll assume that the data is installed to C:nltk_data on Windows, and /usr/share/nltk_data on Linux, Unix, or Mac OS X.

How to do it...

NLTK defines a list of data directories, or paths, in nltk.data.path. Our custom corpora must be within one of these paths so it can be found by NLTK. So as not to conflict with the official data package, we'll create a custom nltk_data directory in our home directory. Here's some Python code to create this directory and verify that it is in the list of known paths specified by nltk.data.path:

>>> import os, os.path
>>> path = os.path.expanduser('~/nltk_data')
>>> if not os.path.exists(path):
... os.mkdir(path)
>>> os.path.exists(path)
True
>>> import nltk.data
>>> path in nltk.data.path
True

If the last line, path in nltk.data.path, is True, then you should now have a nltk_ data directory in your home directory. The path should be %UserProfile%nltk_data on Windows, or ~/nltk_data on Unix, Linux, or Mac OS X. For simplicity, I'll refer to the directory as ~/nltk_data.

If the last line does not return True, try creating the nltk_data directory manually in your home directory, then verify that the absolute path is in nltk.data.path. It's essential to ensure that this directory exists and is in nltk.data.path before continuing. Once you have your nltk_data directory, the convention is that corpora reside in a corpora subdirectory. Create this corpora directory within the nltk_data directory, so that the path is ~/nltk_data/corpora. Finally, we'll create a subdirectory in corpora to hold our custom corpus. Let's call it cookbook, giving us the full path of ~/nltk_data/corpora/cookbook.

Now we can create a simple word list file and make sure it loads. The source code for this article can be downloaded here. Consider a word list file called mywords.txt. Put this file into ~/nltk_data/corpora/cookbook/. Now we can use nltk.data.load() to load the file.

>>> import nltk.data
>>> nltk.data.load('corpora/cookbook/mywords.txt', format='raw')
'nltkn'

We need to specify format='raw' since nltk.data.load() doesn't know how to interpret .txt files. As we'll see, it does know how to interpret a number of other file formats.

How it works...

The nltk.data.load() function recognizes a number of formats, such as 'raw', 'pickle', and 'yaml'. If no format is specified, then it tries to guess the format based on the file's extension. In the previous case, we have a .txt file, which is not a recognized extension, so we have to specify the 'raw' format. But if we used a file that ended in .yaml, then we would not need to specify the format.

Filenames passed in to nltk.data.load() can be absolute or relative paths. Relative paths must be relative to one of the paths specified in nltk.data.path. The file is found using nltk.data.find(path), which searches all known paths combined with the relative path. Absolute paths do not require a search, and are used as is.

There's more...

For most corpora access, you won't actually need to use nltk.data.load, as that will be handled by the CorpusReader classes covered in the following recipes. But it's a good function to be familiar with for loading .pickle files and .yaml files, plus it introduces the idea of putting all of your data files into a path known by NLTK.

Loading a YAML file

If you put the synonyms.yaml file into ~/nltk_data/corpora/cookbook (next to mywords.txt), you can use nltk.data. load() to load it without specifying a format.

>>> import nltk.data
>>> nltk.data.load('corpora/cookbook/synonyms.yaml')
{'bday': 'birthday'}

This assumes that PyYAML is installed. If not, you can find download and installation instructions at http://pyyaml.org/wiki/PyYAML.

Creating a word list corpus

The WordListCorpusReader is one of the simplest CorpusReader classes. It provides access to a file containing a list of words, one word per line.

Getting ready

We need to start by creating a word list file. This could be a single column CSV file, or just a normal text file with one word per line. Let's create a file named wordlist that looks like this:

nltk

corpus

corpora

wordnet

How to do it...

Now we can instantiate a WordListCorpusReader that will produce a list of words from our file. It takes two arguments: the directory path containing the files, and a list of filenames. If you open the Python console in the same directory as the files, then '.' can be used as the directory path. Otherwise, you must use a directory path such as: 'nltk_data/corpora/ cookbook'.

>>> from nltk.corpus.reader import WordListCorpusReader
>>> reader = WordListCorpusReader('.', ['wordlist'])
>>> reader.words()
['nltk', 'corpus', 'corpora', 'wordnet']
>>> reader.fileids()
['wordlist']

How it works...

WordListCorpusReader inherits from CorpusReader, which is a common base class for all corpus readers. CorpusReader does all the work of identifying which files to read, while WordListCorpus reads the files and tokenizes each line to produce a list of words. Here's an inheritance diagram:

python-text-processing-nltk-20-creating-custom-corpora-img-0

When you call the words() function, it calls nltk.tokenize.line_tokenize() on the raw file data, which you can access using the raw() function.

>>> reader.raw()
'nltkncorpusncorporanwordnetn'
>>> from nltk.tokenize import line_tokenize
>>> line_tokenize(reader.raw())
['nltk', 'corpus', 'corpora', 'wordnet']

There's more...

The stopwords corpus is a good example of a multi-file WordListCorpusReader.

Names corpus

Another word list corpus that comes with NLTK is the names corpus. It contains two files: female.txt and male.txt, each containing a list of a few thousand common first names organized by gender.

>>> from nltk.corpus import names
>>> names.fileids()
['female.txt', 'male.txt']
>>> len(names.words('female.txt'))
5001
>>> len(names.words('male.txt'))
2943

English words

NLTK also comes with a large list of English words. There's one file with 850 basic words, and another list with over 200,000 known English words.

>>> from nltk.corpus import words
>>> words.fileids()
['en', 'en-basic']
>>> len(words.words('en-basic'))
850
>>> len(words.words('en'))
234936

Creating a part-of-speech tagged word corpus

Part-of-speech tagging is the process of identifying the part-of-speech tag for a word. Most of the time, a tagger must first be trained on a training corpus. Let us take a look at how to create and use a training corpus of part-of-speech tagged words.

Getting ready

The simplest format for a tagged corpus is of the form "word/tag". Following is an excerpt from the brown corpus:

The/at-tl expense/nn and/cc time/nn involved/vbn are/ber astronomical/ jj ./.

Each word has a tag denoting its part-of-speech. For example, nn refers to a noun, while a tag that starts with vb is a verb.

How to do it...

If you were to put the previous excerpt into a file called brown.pos, you could then create a TaggedCorpusReader and do the following:

>>> from nltk.corpus.reader import TaggedCorpusReader
>>> reader = TaggedCorpusReader('.', r'.*.pos')
>>> reader.words()
['The', 'expense', 'and', 'time', 'involved', 'are', ...]
>>> reader.tagged_words()
[('The', 'AT-TL'), ('expense', 'NN'), ('and', 'CC'), …]
>>> reader.sents()
[['The', 'expense', 'and', 'time', 'involved', 'are', 'astronomical',
'.']]
>>> reader.tagged_sents()
[[('The', 'AT-TL'), ('expense', 'NN'), ('and', 'CC'), ('time', 'NN'),
('involved', 'VBN'), ('are', 'BER'), ('astronomical', 'JJ'), ('.',
'.')]]
>>> reader.paras()
[[['The', 'expense', 'and', 'time', 'involved', 'are', 'astronomical',
'.']]]
>>> reader.tagged_paras()
[[[('The', 'AT-TL'), ('expense', 'NN'), ('and', 'CC'), ('time', 'NN'),
('involved', 'VBN'), ('are', 'BER'), ('astronomical', 'JJ'), ('.',
'.')]]]

How it works...

This time, instead of naming the file explicitly, we use a regular expression, r'.*.pos', to match all files whose name ends with .pos. We could have done the same thing as we did with the WordListCorpusReader, and pass ['brown.pos'] as the second argument, but this way you can see how to include multiple files in a corpus without naming each one explicitly.

TaggedCorpusReader provides a number of methods for extracting text from a corpus. First, you can get a list of all words, or a list of tagged tokens. A tagged token is simply a tuple of (word, tag). Next, you can get a list of every sentence, and also every tagged sentence, where the sentence is itself a list of words or tagged tokens. Finally, you can get a list of paragraphs, where each paragraph is a list of sentences, and each sentence is a list of words or tagged tokens. Here's an inheritance diagram listing all the major methods:

python-text-processing-nltk-20-creating-custom-corpora-img-1

There's more...

The functions demonstrated in the previous diagram all depend on tokenizers for splitting the text. TaggedCorpusReader tries to have good defaults, but you can customize them by passing in your own tokenizers at initialization time.

Customizing the word tokenizer

The default word tokenizer is an instance of nltk.tokenize.WhitespaceTokenizer. If you want to use a different tokenizer, you can pass that in as word_tokenizer.

>>> from nltk.tokenize import SpaceTokenizer
>>> reader = TaggedCorpusReader('.', r'.*.pos', word_
tokenizer=SpaceTokenizer())
>>> reader.words()
['The', 'expense', 'and', 'time', 'involved', 'are', ...]

Customizing the sentence tokenizer

The default sentence tokenizer is an instance of nltk.tokenize.RegexpTokenize with 'n' to identify the gaps. It assumes that each sentence is on a line all by itself, and individual sentences do not have line breaks. To customize this, you can pass in your own tokenizer as sent_tokenizer.

>>> from nltk.tokenize import LineTokenizer
>>> reader = TaggedCorpusReader('.', r'.*.pos', sent_
tokenizer=LineTokenizer())
>>> reader.sents()
[['The', 'expense', 'and', 'time', 'involved', 'are', 'astronomical',
'.']]

Customizing the paragraph block reader

Paragraphs are assumed to be split by blank lines. This is done with the default para_ block_reader, which is nltk.corpus.reader.util.read_blankline_block. There are a number of other block reader functions in nltk.corpus.reader.util, whose purpose is to read blocks of text from a stream. Their usage will be covered in more detail in the later recipe, Creating a custom corpus view, where we'll create a custom corpus reader.

Customizing the tag separator

If you don't want to use '/' as the word/tag separator, you can pass an alternative string to TaggedCorpusReader for sep. The default is sep='/', but if you want to split words and tags with '|', such as 'word|tag', then you should pass in sep='|'.

Simplifying tags with a tag mapping function

If you'd like to somehow transform the part-of-speech tags, you can pass in a tag_mapping_ function at initialization, then call one of the tagged_* functions with simplify_ tags=True. Here's an example where we lowercase each tag:

>>> reader = TaggedCorpusReader('.', r'.*.pos', tag_mapping_
function=lambda t: t.lower())
>>> reader.tagged_words(simplify_tags=True)
[('The', 'at-tl'), ('expense', 'nn'), ('and', 'cc'), …]

Calling tagged_words() without simplify_tags=True would produce the same result as if you did not pass in a tag_mapping_function.

There are also a number of tag simplification functions defined in nltk.tag.simplify. These can be useful for reducing the number of different part-of-speech tags.

>>> from nltk.tag import simplify
>>> reader = TaggedCorpusReader('.', r'.*.pos', tag_mapping_
function=simplify.simplify_brown_tag)
>>> reader.tagged_words(simplify_tags=True)
[('The', 'DET'), ('expense', 'N'), ('and', 'CNJ'), ...]
>>> reader = TaggedCorpusReader('.', r'.*.pos', tag_mapping_
function=simplify.simplify_tag)
>>> reader.tagged_words(simplify_tags=True)
[('The', 'A'), ('expense', 'N'), ('and', 'C'), ...]