9 min read

 

Python Text Processing with NLTK 2.0 Cookbook

Python Text Processing with NLTK 2.0 Cookbook

Use Python’s NLTK suite of libraries to maximize your Natural Language Processing capabilities.

  • Quickly get to grips with Natural Language Processing – with Text Analysis, Text Mining, and beyond
  • Learn how machines and crawlers interpret and process natural languages
  • Easily work with huge amounts of data and learn how to handle distributed processing
  • Part of Packt’s Cookbook series: Each recipe is a carefully organized sequence of instructions to complete the task as efficiently as possible

Introduction

This article will show you how to do various transforms on both chunks and trees. The chunk transforms are for grammatical correction and rearranging phrases without loss of meaning. The tree transforms give you ways to modify and flatten deep parse trees.

The functions detailed in these recipes modify data, as opposed to learning from it. That means it’s not safe to apply them indiscriminately. A thorough knowledge of the data you want to transform, along with a few experiments, should help you decide which functions to apply and when.

Whenever the term chunk is used in this article, it could refer to an actual chunk extracted by a chunker, or it could simply refer to a short phrase or sentence in the form of a list of tagged words. What’s important in this article is what you can do with a chunk, not where it came from.

Filtering insignificant words

Many of the most commonly used words are insignificant when it comes to discerning the meaning of a phrase. For example, in the phrase “the movie was terrible”, the most significant words are “movie” and “terrible”, while “the” and “was” are almost useless. You could get the same meaning if you took them out, such as “movie terrible” or “terrible movie”. Either way, the sentiment is the same. In this recipe, we’ll learn how to remove the insignificant words, and keep the significant ones, by looking at their part-of-speech tags.

Getting ready

First, we need to decide which part-of-speech tags are significant and which are not. Looking through the treebank corpus for stopwords yields the following table of insignificant words and tags:

 

Word

Tag

a

DT

all

PDT

an

DT

and

CC

or

CC

that

WDT

the

DT

Other than CC, all the tags end with DT. This means we can filter out insignificant words by looking at the tag’s suffix.

How to do it…

In transforms.py there is a function called filter_insignificant(). It takes a single chunk, which should be a list of tagged words, and returns a new chunk without any insignificant tagged words. It defaults to filtering out any tags that end with DT or CC.

def filter_insignificant(chunk, tag_suffixes=[‘DT’, ‘CC’]):
good = []

for word, tag in chunk:
ok = True

for suffix in tag_suffixes:
if tag.endswith(suffix):
ok = False
break

if ok:
good.append((word, tag))

return good


Now we can use it on the part-of-speech tagged version of “the terrible movie”.

>>> from transforms import filter_insignificant
>>> filter_insignificant([(‘the’, ‘DT’), (‘terrible’, ‘JJ’), (‘movie’,
‘NN’)])
[(‘terrible’, ‘JJ’), (‘movie’, ‘NN’)]


As you can see, the word “the” is eliminated from the chunk.

How it works…

filter_insignificant() iterates over the tagged words in the chunk. For each tag, it checks if that tag ends with any of the tag_suffixes. If it does, then the tagged word is skipped. However if the tag is ok, then the tagged word is appended to a new good chunk that is returned.

There’s more…

The way filter_insignificant() is defined, you can pass in your own tag suffixes if DT and CC are not enough, or are incorrect for your case. For example, you might decide that possessive words and pronouns such as “you”, “your”, “their”, and “theirs” are no good but DT and CC words are ok. The tag suffixes would then be PRP and PRP$. Following is an example of this function:

>>> filter_insignificant([(‘your’, ‘PRP$’), (‘book’, ‘NN’), (‘is’,
‘VBZ’), (‘great’, ‘JJ’)], tag_suffixes=[‘PRP’, ‘PRP$’])
[(‘book’, ‘NN’), (‘is’, ‘VBZ’), (‘great’, ‘JJ’)]


Filtering insignificant words can be a good complement to stopword filtering for purposes such as search engine indexing, querying, and text classification.

Correcting verb forms

It’s fairly common to find incorrect verb forms in real-world language. For example, the correct form of “is our children learning?” is “are our children learning?”. The verb “is” should only be used with singular nouns, while “are” is for plural nouns, such as “children”. We can correct these mistakes by creating verb correction mappings that are used depending on whether there’s a plural or singular noun in the chunk.

Getting ready

We first need to define the verb correction mappings in transforms.py. We’ll create two mappings, one for plural to singular, and another for singular to plural.

plural_verb_forms = {
(‘is’, ‘VBZ’): (‘are’, ‘VBP’),
(‘was’, ‘VBD’): (‘were’, ‘VBD’)
}

singular_verb_forms = {
(‘are’, ‘VBP’): (‘is’, ‘VBZ’),
(‘were’, ‘VBD’): (‘was’, ‘VBD’)
}


Each mapping has a tagged verb that maps to another tagged verb. These initial mappings cover the basics of mapping, is to are, was to were, and vice versa.

How to do it…

In transforms.py there is a function called correct_verbs(). Pass it a chunk with incorrect verb forms, and you’ll get a corrected chunk back. It uses a helper function first_chunk_index() to search the chunk for the position of the first tagged word where pred returns True.

def first_chunk_index(chunk, pred, start=0, step=1):
l = len(chunk)
end = l if step > 0 else -1

for i in range(start, end, step):
if pred(chunk[i]):
return i

return None

def correct_verbs(chunk):
vbidx = first_chunk_index(chunk, lambda (word, tag): tag.
startswith(‘VB’))
# if no verb found, do nothing
if vbidx is None:
return chunk

verb, vbtag = chunk[vbidx]
nnpred = lambda (word, tag): tag.startswith(‘NN’)
# find nearest noun to the right of verb
nnidx = first_chunk_index(chunk, nnpred, start=vbidx+1)
# if no noun found to right, look to the left
if nnidx is None:
nnidx = first_chunk_index(chunk, nnpred, start=vbidx-1, step=-1)
# if no noun found, do nothing
if nnidx is None:
return chunk

noun, nntag = chunk[nnidx]
# get correct verb form and insert into chunk
if nntag.endswith(‘S’):
chunk[vbidx] = plural_verb_forms.get((verb, vbtag), (verb, vbtag))
else:
chunk[vbidx] = singular_verb_forms.get((verb, vbtag), (verb,
vbtag))

return chunk


When we call it on a part-of-speech tagged “is our children learning” chunk, we get back the correct form, “are our children learning”.

>>> from transforms import correct_verbs
>>> correct_verbs([(‘is’, ‘VBZ’), (‘our’, ‘PRP$’), (‘children’,
‘NNS’), (‘learning’, ‘VBG’)])
[(‘are’, ‘VBP’), (‘our’, ‘PRP$’), (‘children’, ‘NNS’), (‘learning’,
‘VBG’)]


We can also try this with a singular noun and an incorrect plural verb.

>>> correct_verbs([(‘our’, ‘PRP$’), (‘child’, ‘NN’), (‘were’, ‘VBD’),
(‘learning’, ‘VBG’)])
[(‘our’, ‘PRP$’), (‘child’, ‘NN’), (‘was’, ‘VBD’), (‘learning’,
‘VBG’)]


In this case, “were” becomes “was” because “child” is a singular noun.

How it works…

The correct_verbs() function starts by looking for a verb in the chunk. If no verb is found, the chunk is returned with no changes. Once a verb is found, we keep the verb, its tag, and its index in the chunk. Then we look on either side of the verb to find the nearest noun, starting on the right, and only looking to the left if no noun is found on the right. If no noun is found at all, the chunk is returned as is. But if a noun is found, then we lookup the correct verb form depending on whether or not the noun is plural.

Plural nouns are tagged with NNS, while singular nouns are tagged with NN. This means we can check the plurality of a noun by seeing if its tag ends with S. Once we get the corrected verb form, it is inserted into the chunk to replace the original verb form.

To make searching through the chunk easier, we define a function called first_chunk_ index(). It takes a chunk, a lambda predicate, the starting index, and a step increment. The predicate function is called with each tagged word until it returns True. If it never returns True, then None is returned. The starting index defaults to zero and the step increment to one. As you’ll see in upcoming recipes, we can search backwards by overriding start and setting step to -1. This small utility function will be a key part of subsequent transform functions.

Swapping verb phrases

Swapping the words around a verb can eliminate the passive voice from particular phrases. For example, “the book was great” can be transformed into “the great book”.

How to do it…

In transforms.py there is a function called swap_verb_phrase(). It swaps the right-hand side of the chunk with the left-hand side, using the verb as the pivot point. It uses the first_chunk_index() function defined in the previous recipe to find the verb to pivot around.

def swap_verb_phrase(chunk):
# find location of verb
vbpred = lambda (word, tag): tag != ‘VBG’ and tag.startswith(‘VB’)
and len(tag) > 2
vbidx = first_chunk_index(chunk, vbpred)

if vbidx is None:
return chunk
return chunk[vbidx+1:] + chunk[:vbidx]


Now we can see how it works on the part-of-speech tagged phrase “the book was great”.

>>> from transforms import swap_verb_phrase
>>> swap_verb_phrase([(‘the’, ‘DT’), (‘book’, ‘NN’), (‘was’, ‘VBD’),
(‘great’, ‘JJ’)])
[(‘great’, ‘JJ’), (‘the’, ‘DT’), (‘book’, ‘NN’)]


The result is “great the book”. This phrase clearly isn’t grammatically correct, so read on to learn how to fix it.

How it works…

Using first_chunk_index() from the previous recipe, we start by finding the first matching verb that is not a gerund (a word that ends in “ing”) tagged with VBG. Once we’ve found the verb, we return the chunk with the right side before the left, and remove the verb.

The reason we don’t want to pivot around a gerund is that gerunds are commonly used to describe nouns, and pivoting around one would remove that description. Here’s an example where you can see how not pivoting around a gerund is a good thing:

>>> swap_verb_phrase([(‘this’, ‘DT’), (‘gripping’, ‘VBG’), (‘book’,
‘NN’), (‘is’, ‘VBZ’), (‘fantastic’, ‘JJ’)])
[(‘fantastic’, ‘JJ’), (‘this’, ‘DT’), (‘gripping’, ‘VBG’), (‘book’,
‘NN’)]


If we had pivoted around the gerund, the result would be “book is fantastic this”, and we’d lose the gerund “gripping”.

There’s more…

Filtering insignificant words makes the final result more readable. By filtering either before or after swap_verb_phrase(), we get “fantastic gripping book” instead of “fantastic this gripping book”.

>>> from transforms import swap_verb_phrase, filter_insignificant
>>> swap_verb_phrase(filter_insignificant([(‘this’, ‘DT’),
(‘gripping’, ‘VBG’), (‘book’, ‘NN’), (‘is’, ‘VBZ’), (‘fantastic’,
‘JJ’)]))
[(‘fantastic’, ‘JJ’), (‘gripping’, ‘VBG’), (‘book’, ‘NN’)]
>>> filter_insignificant(swap_verb_phrase([(‘this’, ‘DT’),
(‘gripping’, ‘VBG’), (‘book’, ‘NN’), (‘is’, ‘VBZ’), (‘fantastic’,
‘JJ’)]))
[(‘fantastic’, ‘JJ’), (‘gripping’, ‘VBG’), (‘book’, ‘NN’)]


Either way, we get a shorter grammatical chunk with no loss of meaning.

LEAVE A REPLY

Please enter your comment!
Please enter your name here