Analyzing Textual Data using the NLTK Library

[box type="note" align="" class="" width=""]This article is an excerpt from a book written by Armando Fandango titled Python Data Analysis - Second Edition. This book will help you learn to apply powerful data analysis techniques with popular open source Python modules. Code bundle for this article is hosted on GitHub.[/box]

In this book excerpt, we will talk about various ways of performing text analytics using the NLTK Library. Natural Language Toolkit (NLTK) is one of the main libraries used for text analysis in Python. It comes with a collection of sample texts called corpora.

Let's install the libraries required in this article with the following command:

$ pip3 install nltk   scikit-learn

NLTK is a Python API for the analysis of texts written in natural languages, such as English. NLTK was created in 2001 and was originally intended as a teaching tool.

Although we installed NLTK in the previous section, we are not done yet; we still need to download the NLTK corpora. The download is relatively large (about 1.8 GB); however, we only have to download it once. Unless you know exactly which corpora you require, it's best to download all the available corpora. Download the corpora from the Python shell as follows:

$ python3

>>> import nltk

>>> nltk.download()

A GUI application should appear, where you can specify a destination and what file to download.

analyzing-textual-data-using-the-nltk-library-img-0

If you are new to NLTK, it's most convenient to choose the default option and download everything. In this article, we will need the stopwords, movie reviews, names, and Gutenberg corpora. Readers are encouraged to follow the sections in the ch-09.ipynb file.

Filtering out stopwords, names, and numbers

Stopwords are common words that have very low information value in a text. It is a common practice in text analysis to get rid of stopwords. NLTK has a stopwords corpora for a number of languages. Load the English stopwords corpus and print some of the words:

sw = set(nltk.corpus.stopwords.words('english')) print("Stop words:", list(sw)[:7])

The following common words are printed:

Stop words: ['between', 'who', 'such', 'ourselves', 'an', 'ain', 'ours']

Note that all the words in this corpus are in lowercase.

NLTK also has a Gutenberg corpus. The Gutenberg project is a digital library of books, mostly with expired copyright, which are available for free on the Internet (see http://www.gutenberg.org/).

Load the Gutenberg corpus and print some of its filenames:

gb = nltk.corpus.gutenberg

print("Gutenberg files:n", gb.fileids()[-5:])

Some of the titles printed may be familiar to you:

Gutenberg files:   ['milton-paradise.txt', 'shakespeare-caesar.txt', 'shakespeare-hamlet.txt', 'shakespeare-macbeth.txt', 'whitman-leaves.txt']

Extract the first couple of sentences from the milton-paradise.txt file, which we will filter later:

text_sent = gb.sents("milton-paradise.txt")[:2] print("Unfiltered:", text_sent)

The following sentences are printed:

Unfiltered [['[', 'Paradise', 'Lost', 'by', 'John', 'Milton',

'1667', ']'], ['Book', 'I']]

Now, filter out the stopwords as follows:

for sent in text_sent:

filtered = [w for w in sent if w.lower() not in sw] print("Filtered:n", filtered)

For the first sentence, we get the following output:

Filtered ['[', 'Paradise', 'Lost', 'John', 'Milton', '1667', ']']

If we compare this with the previous snippet, we notice that the word by has been filtered out as it was found in the stopwords corpus. Sometimes, we want to remove numbers and names too. We can remove words based on part of speech (POS) tags. In this tagging scheme, numbers correspond to the cardinal number (CD) tag. Names correspond to the proper noun singular (NNP) tag. Tagging is an inexact process based on heuristics. It's a big topic that deserves an entire book. Tag the filtered text with the pos_tag() function:

tagged = nltk.pos_tag(filtered) print("Tagged:n", tagged)

For our text, we get the following tags:

Tagged [('[', 'NN'), ('Paradise', 'NNP'), ('Lost', 'NNP'), ('John', 'NNP'),

('Milton', 'NNP'), ('1667', 'CD'), (']', 'CD')]

The pos_tag() function returns a list of tuples, where the second element in each tuple is the tag. As you can see, some of the words are tagged as NNP, although they probably shouldn't be. The heuristic here is to tag words as NNP if the first character of a word is uppercase. If we set all the words to be lowercase, we will get a different result. This is left as an exercise for the reader. It's easy to remove the words in the list with the NNP and CD tags, as described in the following code:

words= []

for word in tagged:

if word[1] != 'NNP' and word[1] != 'CD': words.append(word[0])

print(words)

Have a look at the ch-09.ipynb file in the book’s code bundle:

import nltk

sw = set(nltk.corpus.stopwords.words('english')) print(“Stop words:", list(sw)[:7])

gb = nltk.corpus.gutenberg

print(“Gutenberg files:n", gb.fileids()[-5:])

text_sent = gb.sents("milton-paradise.txt")[:2] print(“Unfiltered:", text_sent)

for sent in text_sent:

filtered = [w for w in sent if w.lower() not in sw] print("Filtered:n", filtered)

tagged = nltk.pos_tag(filtered) print("Tagged:n", tagged)

words= []

for word in tagged:

if word[1] != 'NNP' and word[1] != 'CD': words.append(word[0])

print(“Words:n",words)

The bag-of-words model

In the bag-of-words model, we create from a document a bag containing words found in the document. In this model, we don't care about the word order. For each word in the document, we count the number of occurrences. With these word counts, we can do statistical analysis, for instance, to identify spam in e-mail messages.

If we have a group of documents, we can view each unique word in the corpus as a feature; here, feature means parameter or variable. Using all the word counts, we can build a feature vector for each document; vector is used here in the mathematical sense. If a word is present in the corpus but not in the document, the value of this feature will be 0. Surprisingly, NLTK doesn't currently have a handy utility to create a feature vector. However, the machine learning Python library, scikit-learn, does have a CountVectorizer class that we can use.

Load two text documents from the NLTK Gutenberg corpus:

hamlet = gb.raw("shakespeare-hamlet.txt") macbeth = gb.raw("shakespeare-macbeth.txt")

Create the feature vector by omitting English stopwords:

cv = sk.feature_extraction.text.CountVectorizer(stop_words='english') print("Feature vector:n", cv.fit_transform([hamlet, macbeth]).toarray())

These are the feature vectors for the two documents:

Feature vector:

[[ 1 0 1 ..., 14 0 1]

[ 0 1 0 ..., 1 1 0]]

Print a small selection of the features (unique words) that we found:

print("Features:n", cv.get_feature_names()[:5])

The features are given in alphabetical order:

Features:

['1599', '1603', 'abhominably',   'abhorred',   'abide']

Have a look at the ch-09.ipynb file in this book’s code bundle:

import nltk

import sklearn as sk

hamlet = gb.raw("shakespeare-hamlet.txt") macbeth = gb.raw("shakespeare-macbeth.txt")

cv = sk.feature_extraction.text.CountVectorizer(stop_words='english') print(“Feature vector:n”, cv.fit_transform([hamlet, macbeth]).toarray())

print("Features:n", cv.get_feature_names()[:5])

Analyzing word frequencies

The NLTK FreqDist class encapsulates a dictionary of words and counts for a given list of words. Load the Gutenberg text of Julius Caesar by William Shakespeare. Let's filter out the stopwords and punctuation:

punctuation = set(string.punctuation)

filtered = [w.lower() for w in words if w.lower() not in sw and w.lower() not in punctuation]

Create a FreqDist object and print the associated keys and values with the highest frequency:

fd = nltk.FreqDist(filtered) print("Words", fd.keys()[:5])

print("Counts", fd.values()[:5])

The keys and values are printed as follows:

Words ['d', 'caesar', 'brutus', 'bru', 'haue'] Counts [215, 190, 161, 153, 148]

The first word in this list is, of course, not an English word, so we may need to add the heuristic that words have a minimum of two characters. The NLTK FreqDist class allows dictionary-like access, but it also has convenience methods. Get the word with the highest frequency and related count:

print("Max", fd.max())

print("Count", fd['d'])

The following result shouldn't be a surprise:

Max d Count 215

Up until this point, the analysis has focused on single words, but we can extend the analysis to word pairs and triplets. These are also called bigrams and trigrams. We can find them with the bigrams() and trigrams() functions. Repeat the analysis, but this time for bigrams:

fd = nltk.FreqDist(nltk.bigrams(filtered)) print("Bigrams", fd.keys()[:5])

print("Counts", fd.values()[:5]) print("Bigram Max", fd.max()) print("Bigram count", fd[('let', 'vs')])

The following output should be printed:

Bigrams [('let', 'vs'), ('wee', 'l'), ('mark', 'antony'), ('marke', 'antony'), ('st', 'thou')]

Counts [16, 15, 13, 12, 12]

Bigram Max ('let', 'vs') Bigram count 16

Have a peek at the ch-09.ipynb file in this book's code bundle:

import nltk import string

gb = nltk.corpus.gutenberg

words = gb.words("shakespeare-caesar.txt")

sw = set(nltk.corpus.stopwords.words('english')) punctuation = set(string.punctuation)

filtered = [w.lower() for w in words if w.lower() not in sw and w.lower() not in punctuation]

fd = nltk.FreqDist(filtered) print("Words", fd.keys()[:5])

print("Counts", fd.values()[:5])

print("Max", fd.max())

print("Count", fd['d'])

fd = nltk.FreqDist(nltk.bigrams(filtered)) print("Bigrams", fd.keys()[:5])

print("Counts", fd.values()[:5]) print("Bigram Max", fd.max()) print("Bigram count", fd[('let', 'vs')])

Naive Bayes classification

Classification algorithms are a type of machine learning algorithm that determine the class (category or type) of a given item. For instance, we could try to determine the genre of a movie based on some features. In this case, the genre is the class to be predicted. In this section, we will discuss a popular algorithm called Naive Bayes classification, which is frequently used to analyze text documents.

Naive Bayes classification is a probabilistic algorithm based on the Bayes theorem from probability theory and statistics. The Bayes theorem formulates how to discount the probability of an event based on new evidence. For example, imagine that we have a bag with pieces of chocolate and other items we can't see. We will call the probability of drawing a piece of dark chocolate P(D). We will denote the probability of drawing a piece of chocolate as P(C). Of course, the total probability is always 1, so P(D) and P(C) can be at most 1. The Bayes theorem states that the posterior probability is proportional to the prior probability times likelihood:

P(D|C) in the preceding notation means the probability of event D given C. When we haven't drawn any items, P(D) = 0.5 because we don't have any information yet. To actually apply the formula, we need to know P(C|D) and P(C), or we have to determine those indirectly.

Naive Bayes classification is called naive because it makes the simplifying assumption of independence between features. In practice, the results are usually pretty good, so this assumption is often warranted to a certain level. Recently, it was found that there are theoretical reasons why the assumption makes sense. However, since machine learning is a rapidly evolving field, algorithms have been invented with (slightly) better performance.

Let's try to classify words as stopwords or punctuation. As a feature, we will use the word length, since stopwords and punctuation tend to be short.

This setup leads us to define the following functions:

def word_features(word): return {'len': len(word)}

def isStopword(word):

return word in sw or word in punctuation

Label the words in the Gutenberg shakespeare-caesar.txt based on whether or not they are stopwords:

labeled_words = ([(word.lower(), isStopword(word.lower())) for word in words])

random.seed(42) random.shuffle(labeled_words) print(labeled_words[:5])

The 5 labeled words will appear as follows:

[('was', True), ('greeke', False), ('cause', False), ('but', True), ('house', False)]

For each word, determine its length:

featuresets = [(word_features(n), word) for (n, word) in labeled_words]

We will train a naive Bayes classifier on 90 percent of the words and test the remaining 10 percent. Create the train and the test set, and train the data:

cutoff = int(.9 * len(featuresets))

train_set, test_set = featuresets[:cutoff], featuresets[cutoff:] classifier = nltk.NaiveBayesClassifier.train(train_set)

We can now check how the classifier labels the words in the sets:

classifier = nltk.NaiveBayesClassifier.train(train_set) print("'behold' class", classifier.classify(word_features('behold'))) print("'the' class", classifier.classify(word_features('the')))

Fortunately, the words are properly classified:

'behold' class False 'the' class True

Determine the classifier accuracy on the test set as follows:

print("Accuracy", nltk.classify.accuracy(classifier, test_set))

We get a high accuracy for this classifier of around 85 percent. Print an overview of the most informative features:

print(classifier.show_most_informative_features(5))

The overview shows the word lengths that are most useful for the classification process:

analyzing-textual-data-using-the-nltk-library-img-1

The code is in the ch-09.ipynb file in this book's code bundle:

import nltk import string import random

sw = set(nltk.corpus.stopwords.words('english')) punctuation = set(string.punctuation)

def word_features(word): return {'len': len(word)}

def isStopword(word):

return word in sw or word in punctuation gb = nltk.corpus.gutenberg

words = gb.words("shakespeare-caesar.txt")

labeled_words = ([(word.lower(), isStopword(word.lower())) for word in words])

random.seed(42) random.shuffle(labeled_words) print(labeled_words[:5])

featuresets = [(word_features(n), word) for (n, word) in labeled_words] cutoff = int(.9 * len(featuresets))

train_set, test_set = featuresets[:cutoff], featuresets[cutoff:] classifier = nltk.NaiveBayesClassifier.train(train_set) print("'behold' class", classifier.classify(word_features('behold'))) print("'the' class", classifier.classify(word_features('the')))

print("Accuracy", nltk.classify.accuracy(classifier, test_set)) print(classifier.show_most_informative_features(5))

Sentiment analysis

Opinion mining or sentiment analysis is a hot new research field dedicated to the automatic evaluation of opinions as expressed on social media, product review websites, or other forums. Often, we want to know whether an opinion is positive, neutral, or negative. This is, of course, a form of classification, as seen in the previous section. As such, we can apply any number of classification algorithms. Another approach is to semi-automatically (with some manual editing) compose a list of words with an associated numerical sentiment score (the word “good” can have a score of 5 and the word “bad” a score of -5). If we have such a list, we can look up all the words in a text document and, for example, sum up all the found sentiment scores. The number of classes can be more than three, as in a five-star rating scheme.

We will apply naive Bayes classification to the NLTK movie reviews corpus with the goal of classifying movie reviews as either positive or negative. First, we will load the corpus and filter out stopwords and punctuation. These steps will be omitted, since we have performed them before. You may consider more elaborate filtering schemes, but keep in mind that excessive filtering may hurt accuracy. Label the movie reviews documents using the categories() method:

labeled_docs = [(list(movie_reviews.words(fid)), cat) for cat in movie_reviews.categories()

for fid in movie_reviews.fileids(cat)]

The complete corpus has tens of thousands of unique words that we can use as features. However, using all these words might be inefficient. Select the top 5 percent of the most frequent words:

words = FreqDist(filtered)

N = int(.05 * len(words.keys())) word_features = words.keys()[:N]

For each document, we can extract features using a number of methods, including the following:

Check whether the given document has a word or not
Determine the number of occurrences of a word for a given document
Normalize word counts so that the maximum normalized word count will be less than or equal to 1
Take the logarithm of counts plus 1 (to avoid taking the logarithm of zero) Combine all the previous points into one metric

As the saying goes, all roads lead to Rome. Of course, some roads are safer and will bring you to Rome faster. Define the following function, which uses raw word counts as a metric:

def doc_features(doc):

doc_words = FreqDist(w for w in doc if not isStopWord(w)) features = {}

for word in word_features:

features['count (%s)' % word] = (doc_words.get(word, 0)) return features

We can now train our classifier just as we did in the previous example. An accuracy of 78 percent is reached, which is decent and comes close to what is possible with sentiment analysis. Research has found that even humans don't always agree on the sentiment of a given document (see http://mashable.com/2010/04/19/sentiment-analysis/), and therefore, we can't have a 100 percent perfect accuracy with sentiment analysis software.

The most informative features are printed as follows:

analyzing-textual-data-using-the-nltk-library-img-2

If we go through this list, we find obvious positive words such as “wonderful” and “outstanding”. The words “bad”, “stupid”, and “boring” are the obvious negative words. It would be interesting to analyze the remaining features. This is left as an exercise for the reader. Refer to the sentiment.py file in this book's code bundle:

import random

from nltk.corpus import movie_reviews from nltk.corpus import stopwords from nltk import FreqDist

from nltk import NaiveBayesClassifier from nltk.classify import accuracy import string

labeled_docs = [(list(movie_reviews.words(fid)), cat) for cat in movie_reviews.categories()

for fid in movie_reviews.fileids(cat)] random.seed(42)

random.shuffle(labeled_docs)

review_words = movie_reviews.words() print("# Review Words", len(review_words))

sw = set(stopwords.words('english')) punctuation = set(string.punctuation)

def isStopWord(word):

return word in sw or word in punctuation

filtered = [w.lower() for w in review_words if not isStopWord(w.lower())] print("# After filter", len(filtered))

words = FreqDist(filtered)

N = int(.05 * len(words.keys())) word_features = words.keys()[:N]

def doc_features(doc):

doc_words = FreqDist(w for w in doc if not isStopWord(w)) features = {}

for word in word_features:

features['count (%s)' % word] = (doc_words.get(word, 0)) return features

featuresets = [(doc_features(d), c) for (d,c) in labeled_docs] train_set, test_set = featuresets[200:], featuresets[:200] classifier = NaiveBayesClassifier.train(train_set) print("Accuracy", accuracy(classifier, test_set))

print(classifier.show_most_informative_features())

We covered textual analysis and learned that it's a best practice to get rid of stopwords.

In the bag-of-words model, we used a document to create a bag containing words found in that same document. We learned how to build a feature vector for each document using all the word counts.

Classification algorithms are a type of machine learning algorithm, which involve determining the class of a given item. Naive Bayes classification is a probabilistic algorithm based on the Bayes theorem from probability theory and statistics. The Bayes theorem states that the posterior probability is proportional to the prior probability multiplied by the likelihood.

If you liked this post, check out the book Python Data Analysis - Second Edition to know more about analyzing other forms of textual data and social media analysis.