Classifying Text

22 min read

In this article by Jacob Perkins, author of Python 3 Text Processing with NLTK 3 Cookbook, we will learn how to transform text into feature dictionaries, and how to train a text classifier for sentiment analysis.

(For more resources related to this topic, see here.)

Bag of words feature extraction

Text feature extraction is the process of transforming what is essentially a list of words into a feature set that is usable by a classifier. The NLTK classifiers expect dict style feature sets, so we must therefore transform our text into a dict. The bag of words model is the simplest method; it constructs a word presence feature set from all the words of an instance. This method doesn’t care about the order of the words, or how many times a word occurs, all that matters is whether the word is present in a list of words.

How to do it…

The idea is to convert a list of words into a dict, where each word becomes a key with the value True. The bag_of_words() function in featx.py looks like this:

def bag_of_words(words):
  return dict([(word, True) for word in words])

We can use it with a list of words; in this case, the tokenized sentence

the quick brown fox:

>>> from featx import bag_of_words
>>> bag_of_words(['the', 'quick', 'brown', 'fox'])
{'quick': True, 'brown': True, 'the': True, 'fox': True}

The resulting dict is known as a bag of words because the words are not in order, and it doesn’t matter where in the list of words they occurred, or how many times they occurred. All that matters is that the word is found at least once.

You can use different values than True, but it is important to keep in mind that the NLTK classifiers learn from the unique combination of (key, value). That means that (‘fox’, 1) is treated as a different feature than (‘fox’, 2).

How it works…

The bag_of_words() function is a very simple list comprehension that constructs a dict from the given words, where every word gets the value True.

Since we have to assign a value to each word in order to create a dict, True is a logical choice for the value to indicate word presence. If we knew the universe of all possible words, we could assign the value False to all the words that are not in the given list of words. But most of the time, we don’t know all the possible words beforehand. Plus, the dict that would result from assigning False to every possible word would be very large (assuming all words in the English language are possible). So instead, to keep feature extraction simple and use less memory, we stick to assigning the value True to all words that occur at least once. We don’t assign the value False to any word since we don’t know what the set of possible words are; we only know about the words we are given.

There’s more…

In the default bag of words model, all words are treated equally. But that’s not always a good idea. As we already know, some words are so common that they are practically meaningless. If you have a set of words that you want to exclude, you can use the bag_of_words_not_in_set() function in featx.py:

def bag_of_words_not_in_set(words, badwords):
  return bag_of_words(set(words) - set(badwords))

This function can be used, among other things, to filter stopwords. Here’s an example where we filter the word the from the quick brown fox:

>>> from featx import bag_of_words_not_in_set
>>> bag_of_words_not_in_set(['the', 'quick', 'brown', 'fox'], ['the'])
{'quick': True, 'brown': True, 'fox': True}

As expected, the resulting dict has quick, brown, and fox, but not the.

Filtering stopwords

Stopwords are words that are often useless in NLP, in that they don’t convey much meaning, such as the word the. Here’s an example of using the bag_of_words_not_in_set() function to filter all English stopwords:

from nltk.corpus import stopwords
def bag_of_non_stopwords(words, stopfile='english'):
  badwords = stopwords.words(stopfile)
  return bag_of_words_not_in_set(words, badwords)

You can pass a different language filename as the stopfile keyword argument if you are using a language other than English. Using this function produces the same result as the previous example:

>>> from featx import bag_of_non_stopwords
>>> bag_of_non_stopwords(['the', 'quick', 'brown', 'fox'])
{'quick': True, 'brown': True, 'fox': True}

Here, the is a stopword, so it is not present in the returned dict.

Including significant bigrams

In addition to single words, it often helps to include significant bigrams. As significant bigrams are less common than most individual words, including them in the bag of words model can help the classifier make better decisions. We can use the BigramCollocationFinder class to find significant bigrams. The bag_of_bigrams_words() function found in featx.py will return a dict of all words along with the 200 most significant bigrams:

from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
def bag_of_bigrams_words(words, score_fn=BigramAssocMeasures.chi_sq, n=200):
  bigram_finder = BigramCollocationFinder.from_words(words)
  bigrams = bigram_finder.nbest(score_fn, n)
  return bag_of_words(words + bigrams)

The bigrams will be present in the returned dict as (word1, word2) and will have the value as True. Using the same example words as we did earlier, we get all words plus every bigram:

>>> from featx import bag_of_bigrams_words
>>> bag_of_bigrams_words(['the', 'quick', 'brown', 'fox'])
{'brown': True, ('brown', 'fox'): True, ('the', 'quick'): 
True, 'fox': True, ('quick', 'brown'): True, 'quick': True, 'the': True}

You can change the maximum number of bigrams found by altering the keyword argument n.

Training a Naive Bayes classifier

Now that we can extract features from text, we can train a classifier. The easiest classifier to get started with is the NaiveBayesClassifier class. It uses the Bayes theorem to predict the probability that a given feature set belongs to a particular label. The formula is:

P(label | features) = P(label) * P(features | label) / P(features)

The following list describes the various parameters from the previous formula:

P(label): This is the prior probability of the label occurring, which is the likelihood that a random feature set will have the label. This is based on the number of training instances with the label compared to the total number of training instances. For example, if 60/100 training instances have the label, the prior probability of the label is 60%.
P(features | label): This is the prior probability of a given feature set being classified as that label. This is based on which features have occurred with each label in the training data.
P(features): This is the prior probability of a given feature set occurring. This is the likelihood of a random feature set being the same as the given feature set, and is based on the observed feature sets in the training data. For example, if the given feature set occurs twice in 100 training instances, the prior probability is 2%.
P(label | features): This tells us the probability that the given features should have that label. If this value is high, then we can be reasonably confident that the label is correct for the given features.

Getting ready

We are going to be using the movie_reviews corpus for our initial classification examples. This corpus contains two categories of text: pos and neg. These categories are exclusive, which makes a classifier trained on them a binary classifier. Binary classifiers have only two classification labels, and will always choose one or the other.

Each file in the movie_reviews corpus is composed of either positive or negative movie reviews. We will be using each file as a single instance for both training and testing the classifier. Because of the nature of the text and its categories, the classification we will be doing is a form of sentiment analysis. If the classifier returns pos, then the text expresses a positive sentiment, whereas if we get neg, then the text expresses a negative sentiment.

How to do it…

For training, we need to first create a list of labeled feature sets. This list should be of the form [(featureset, label)], where the featureset variable is a dict and label is the known class label for the featureset. The label_feats_from_corpus() function in featx.py takes a corpus, such as movie_reviews, and a feature_detector function, which defaults to bag_of_words. It then constructs and returns a mapping of the form {label: [featureset]}. We can use this mapping to create a list of labeled training instances and testing instances. The reason to do it this way is to get a fair sample from each label. It is important to get a fair sample, because parts of the corpus may be (unintentionally) biased towards one label or the other. Getting a fair sample should eliminate this possible bias:

import collections
def label_feats_from_corpus(corp, feature_detector=bag_of_words):
  label_feats = collections.defaultdict(list)
  for label in corp.categories():
    for fileid in corp.fileids(categories=[label]):
      feats = feature_detector(corp.words(fileids=[fileid]))
      label_feats[label].append(feats)
  return label_feats

Once we can get a mapping of label | feature sets, we want to construct a list of labeled training instances and testing instances. The split_label_feats() function in featx.py takes a mapping returned from label_feats_from_corpus() and splits each list of feature sets into labeled training and testing instances:

def split_label_feats(lfeats, split=0.75):
  train_feats = []
  test_feats = []
  for label, feats in lfeats.items():
    cutoff = int(len(feats) * split)
    train_feats.extend([(feat, label) for feat in feats[:cutoff]])
    test_feats.extend([(feat, label) for feat in feats[cutoff:]])
  return train_feats, test_feats

Using these functions with the movie_reviews corpus gives us the lists of labeled feature sets we need to train and test a classifier:

>>> from nltk.corpus import movie_reviews
>>> from featx import label_feats_from_corpus, split_label_feats
>>> movie_reviews.categories()
['neg', 'pos']
>>> lfeats = label_feats_from_corpus(movie_reviews)
>>> lfeats.keys()
dict_keys(['neg', 'pos'])
>>> train_feats, test_feats = split_label_feats(lfeats, split=0.75)
>>> len(train_feats)
1500
>>> len(test_feats)
500

So there are 1000 pos files, 1000 neg files, and we end up with 1500 labeled training instances and 500 labeled testing instances, each composed of equal parts of pos and neg. If we were using a different dataset, where the classes were not balanced, our training and testing data would have the same imbalance.

Now we can train a NaiveBayesClassifier class using its train() class method:

>>> from nltk.classify import NaiveBayesClassifier
>>> nb_classifier = NaiveBayesClassifier.train(train_feats)
>>> nb_classifier.labels()
['neg', 'pos']

Let’s test the classifier on a couple of made up reviews. The classify() method takes a single argument, which should be a feature set. We can use the same bag_of_words() feature detector on a list of words to get our feature set:

>>> from featx import bag_of_words
>>> negfeat = bag_of_words(['the', 'plot', 'was', 'ludicrous'])
>>> nb_classifier.classify(negfeat)
'neg'
>>> posfeat = bag_of_words(['kate', 'winslet', 'is', 'accessible'])
>>> nb_classifier.classify(posfeat)
'pos'

How it works…

The label_feats_from_corpus() function assumes that the corpus is categorized, and that a single file represents a single instance for feature extraction. It iterates over each category label, and extracts features from each file in that category using the feature_detector() function, which defaults to bag_of_words(). It returns a dict whose keys are the category labels, and the values are lists of instances for that category.

If we had label_feats_from_corpus() return a list of labeled feature sets instead of a dict, it would be much harder to get balanced training data. The list would be ordered by label, and if you took a slice of it, you would almost certainly be getting far more of one label than another. By returning a dict, you can take slices from the feature sets of each label, in the same proportion that exists in the data.

Now we need to split the labeled feature sets into training and testing instances using split_label_feats(). This function allows us to take a fair sample of labeled feature sets from each label, using the split keyword argument to determine the size of the sample. The split argument defaults to 0.75, which means the first 75% of the labeled feature sets for each label will be used for training, and the remaining 25% will be used for testing.

Once we have gotten our training and testing feats split up, we train a classifier using the NaiveBayesClassifier.train() method. This class method builds two probability distributions for calculating prior probabilities. These are passed into the NaiveBayesClassifier constructor. The label_probdist constructor contains the prior probability for each label, or P(label). The feature_probdist constructor contains P(feature name = feature value | label). In our case, it will store P(word=True | label). Both are calculated based on the frequency of occurrence of each label and each feature name and value in the training data.

The NaiveBayesClassifier class inherits from ClassifierI, which requires subclasses to provide a labels() method, and at least one of the classify() or prob_classify() methods. The following diagram shows other methods, which will be covered shortly:

There’s more…

We can test the accuracy of the classifier using nltk.classify.util.accuracy() and the test_feats variable created previously:

>>> from nltk.classify.util import accuracy
>>> accuracy(nb_classifier, test_feats)
0.728

This tells us that the classifier correctly guessed the label of nearly 73% of the test feature sets.

The code in this article is run with the PYTHONHASHSEED=0 environment variable so that accuracy calculations are consistent. If you run the code with a different value for PYTHONHASHSEED, or without setting this environment variable, your accuracy values may differ.

Classification probability

While the classify() method returns only a single label, you can use the prob_classify() method to get the classification probability of each label. This can be useful if you want to use probability thresholds for classification:

>>> probs = nb_classifier.prob_classify(test_feats[0][0])
>>> probs.samples()
dict_keys(['neg', 'pos'])
>>> probs.max()
'pos'
>>> probs.prob('pos')
0.9999999646430913
>>> probs.prob('neg')
3.535688969240647e-08

In this case, the classifier says that the first test instance is nearly 100% likely to be pos. Other instances may have more mixed probabilities. For example, if the classifier says an instance is 60% pos and 40% neg, that means the classifier is 60% sure the instance is pos, but there is a 40% chance that it is neg. It can be useful to know this for situations where you only want to use strongly classified instances, with a threshold of 80% or greater.

Most informative features

The NaiveBayesClassifier class has two methods that are quite useful for learning about your data. Both methods take a keyword argument n to control how many results to show. The most_informative_features() method returns a list of the form [(feature name, feature value)] ordered by most informative to least informative. In our case, the feature value will always be True:

>>> nb_classifier.most_informative_features(n=5)
[('magnificent', True), ('outstanding', True), ('insulting', True),
('vulnerable', True), ('ludicrous', True)]

The show_most_informative_features() method will print out the results from most_informative_features() and will also include the probability of a feature pair belonging to each label:

>>> nb_classifier.show_most_informative_features(n=5)
Most Informative Features

    magnificent = True    pos : neg = 15.0 : 1.0
    outstanding = True    pos : neg = 13.6 : 1.0
    insulting = True      neg : pos = 13.0 : 1.0
    vulnerable = True     pos : neg = 12.3 : 1.0
    ludicrous = True      neg : pos = 11.8 : 1.0

The informativeness, or information gain, of each feature pair is based on the prior probability of the feature pair occurring for each label. More informative features are those that occur primarily in one label and not on the other. The less informative features are those that occur frequently with both labels. Another way to state this is that the entropy of the classifier decreases more when using a more informative feature. See https://en.wikipedia.org/wiki/Information_gain_in_decision_trees for more on information gain and entropy (while it specifically mentions decision trees, the same concepts are applicable to all classifiers).

Training estimator

During training, the NaiveBayesClassifier class constructs probability distributions for each feature using an estimator parameter, which defaults to nltk.probability.ELEProbDist. The estimator is used to calculate the probability of a label parameter given a specific feature. In ELEProbDist, ELE stands for Expected Likelihood Estimate, and the formula for calculating the label probabilities for a given feature is (c+0.5)/(N+B/2). Here, c is the count of times a single feature occurs, N

is the total number of feature outcomes observed, and B is the number of bins or unique features in the feature set. In cases where the feature values are all True, N == B. In other cases, where the number of times a feature occurs is recorded, then N >= B.

You can use any estimator parameter you want, and there are quite a few to choose from. The only constraints are that it must inherit from nltk.probability.ProbDistI and its constructor must take a bins keyword argument. Here’s an example using the LaplaceProdDist class, which uses the formula (c+1)/(N+B):

>>> from nltk.probability import LaplaceProbDist
>>> nb_classifier = NaiveBayesClassifier.train(train_feats, estimator=LaplaceProbDist)
>>> accuracy(nb_classifier, test_feats)
0.716

As you can see, accuracy is slightly lower, so choose your estimator parameter carefully. You cannot use nltk.probability.MLEProbDist as the estimator, or any ProbDistI subclass that does not take the bins keyword argument. Training will fail with TypeError: __init__() got an unexpected keyword argument ‘bins’.

Manual training

You don’t have to use the train() class method to construct a NaiveBayesClassifier. You can instead create the label_probdist and feature_probdist variables manually. The label_probdist variable should be an instance of ProbDistI, and should contain the prior probabilities for each label. The feature_probdist variable should be a dict whose keys are tuples of the form (label, feature name) and whose values are instances of ProbDistI that have the probabilities for each feature value. In our case, each ProbDistI should have only one value, True=1. Here’s a very simple example using a manually constructed DictionaryProbDist class:

>>> from nltk.probability import DictionaryProbDist
>>> label_probdist = DictionaryProbDist({'pos': 0.5, 'neg': 0.5})
>>> true_probdist = DictionaryProbDist({True: 1})
>>> feature_probdist = {('pos', 'yes'): true_probdist, ('neg', 'no'): true_probdist}
>>> classifier = NaiveBayesClassifier(label_probdist, feature_probdist)
>>> classifier.classify({'yes': True})
'pos'
>>> classifier.classify({'no': True})
'neg'

Training a decision tree classifier

The DecisionTreeClassifier class works by creating a tree structure, where each node corresponds to a feature name and the branches correspond to the feature values. Tracing down the branches, you get to the leaves of the tree, which are the classification labels.

How to do it…

Using the same train_feats and test_feats variables we created from the movie_reviews corpus in the previous recipe, we can call the DecisionTreeClassifier.train() class method to get a trained classifier. We pass binary=True because all of our features are binary: either the word is present or it’s not. For other classification use cases where you have multivalued features, you will want to stick to the default binary=False.

In this context, binary refers to feature values, and is not to be confused with a binary classifier. Our word features are binary because the value is either True or the word is not present. If our features could take more than two values, we would have to use binary=False. A binary classifier, on the other hand, is a classifier that only chooses between two labels. In our case, we are training a binary DecisionTreeClassifier on binary features. But it’s also possible to have a binary classifier with non-binary features, or a non-binary classifier with binary features.

The following is the code for training and evaluating the accuracy of a DecisionTreeClassifier class:

>>> dt_classifier = DecisionTreeClassifier.train(train_feats,
binary=True, entropy_cutoff=0.8, depth_cutoff=5, support_cutoff=30)
>>> accuracy(dt_classifier, test_feats)
0.688

The DecisionTreeClassifier class can take much longer to train than the NaiveBayesClassifier class. For that reason, I have overridden the default parameters so it trains faster. These parameters will be explained later.

How it works…

The DecisionTreeClassifier class, like the NaiveBayesClassifier class, is also an instance of ClassifierI, as shown in the following diagram:

During training, the DecisionTreeClassifier class creates a tree where the child nodes are also instances of DecisionTreeClassifier. The leaf nodes contain only a single label, while the intermediate child nodes contain decision mappings for each feature. These decisions map each feature value to another DecisionTreeClassifier, which itself may contain decisions for another feature, or it may be a final leaf node with a classification label. The train() class method builds this tree from the ground up, starting with the leaf nodes. It then refines itself to minimize the number of decisions needed to get to a label by putting the most informative features at the top.

To classify, the DecisionTreeClassifier class looks at the given feature set and traces down the tree, using known feature names and values to make decisions. Because we are creating a binary tree, each DecisionTreeClassifier instance also has a default decision tree, which it uses when a known feature is not present in the feature set being classified. This is a common occurrence in text-based feature sets, and indicates that a known word was not in the text being classified. This also contributes information towards a classification decision.

There’s more…

The parameters passed into DecisionTreeClassifier.train() can be tweaked to improve accuracy or decrease training time. Generally, if you want to improve accuracy, you must accept a longer training time and if you want to decrease the training time, the accuracy will most likely decrease as well. But be careful not to optimize for accuracy too much. A really high accuracy may indicate overfitting, which means the classifier will be excellent at classifying the training data, but not so good on data it has never seen. See https://en.wikipedia.org/wiki/Over_fitting for more on this concept.

Controlling uncertainty with entropy_cutoff

Entropy is the uncertainty of the outcome. As entropy approaches 1.0, uncertainty increases. Conversely, as entropy approaches 0.0, uncertainty decreases. In other words, when you have similar probabilities, the entropy will be high as each probability has a similar likelihood (or uncertainty of occurrence). But the more the probabilities differ, the lower the entropy will be.

The entropy_cutoff value is used during the tree refinement process. The tree refinement process is how the decision tree decides to create new branches. If the entropy of the probability distribution of label choices in the tree is greater than the entropy_cutoff value, then the tree is refined further by creating more branches. But if the entropy is lower than the entropy_cutoff value, then tree refinement is halted.

Entropy is calculated by giving nltk.probability.entropy() a MLEProbDist value created from a FreqDist of label counts. Here’s an example showing the entropy of various FreqDist values. The value of ‘pos’ is kept at 30, while the value of ‘neg’ is manipulated to show that when ‘neg’ is close to ‘pos’, entropy increases, but when it is closer to 1, entropy decreases:

>>> from nltk.probability import FreqDist, MLEProbDist, entropy
>>> fd = FreqDist({'pos': 30, 'neg': 10})
>>> entropy(MLEProbDist(fd))
0.8112781244591328
>>> fd['neg'] = 25
>>> entropy(MLEProbDist(fd))
0.9940302114769565
>>> fd['neg'] = 30
>>> entropy(MLEProbDist(fd))
1.0
>>> fd['neg'] = 1
>>> entropy(MLEProbDist(fd))
0.20559250818508304

What this all means is that if the label occurrence is very skewed one way or the other, the tree doesn’t need to be refined because entropy/uncertainty is low. But when the entropy is greater than entropy_cutoff, then the tree must be refined with further decisions to reduce the uncertainty. Higher values of entropy_cutoff will decrease both accuracy and training time.

Controlling tree depth with depth_cutoff

The depth_cutoff value is also used during refinement to control the depth of the tree. The final decision tree will never be deeper than the depth_cutoff value. The default value is 100, which means that classification may require up to 100 decisions before reaching a leaf node. Decreasing the depth_cutoff value will decrease the training time and most likely decrease the accuracy as well.

Controlling decisions with support_cutoff

The support_cutoff value controls how many labeled feature sets are required to refine the tree. As the DecisionTreeClassifier class refines itself, labeled feature sets are eliminated once they no longer provide value to the training process. When the number of labeled feature sets is less than or equal to support_cutoff, refinement stops, at least for that section of the tree.

Another way to look at it is that support_cutoff specifies the minimum number of instances that are required to make a decision about a feature. If support_cutoff is 20, and you have less than 20 labeled feature sets with a given feature, then you don’t have enough instances to make a good decision, and refinement around that feature must come to a stop.

Summary

In this article, we learned how to transform text into feature dictionaries, and how to train a text classifier for sentiment analysis.

Resources for Article:

Further resources on this subject:

Python Libraries for Geospatial Development [article]
Python Testing: Installing the Robot Framework [article]
Ten IPython essentials [article]

Packt

Next Sam Erskine talks Microsoft System Center »

Previous « Building Simple Boat

Top life hacks for prepping for your IT certification exam

I remember deciding to pursue my first IT certification, the CompTIA A+. I had signed…

3 years ago

Artificial Intelligence

Learn Transformers for Natural Language Processing with Denis Rothman

Key takeaways The transformer architecture has proved to be revolutionary in outperforming the classical RNN…

3 years ago

Servers

Learning Essential Linux Commands for Navigating the Shell Effectively

Once we learn how to deploy an Ubuntu server, how to manage users, and how…

3 years ago

Interviews

Clean Coding in Python with Mariano Anaya

Key-takeaways: Clean code isn’t just a nice thing to have or a luxury in software projects; it's a necessity. If we…

3 years ago

Front-End Web Development

Exploring Forms in Angular – types, benefits and differences   

While developing a web application, or setting dynamic pages and meta tags we need to deal with…

3 years ago

Featured

Gain Practical Expertise with the Latest Edition of Software Architecture with C# 9 and .NET 5

Software architecture is one of the most discussed topics in the software industry today, and…

3 years ago

Classifying Text

Bag of words feature extraction

How to do it…

How it works…

There’s more…

Filtering stopwords

Including significant bigrams

See also

Training a Naive Bayes classifier

Getting ready

How to do it…

How it works…

There’s more…

Classification probability

Most informative features

Training estimator

Manual training

See also

Training a decision tree classifier

How to do it…

How it works…

There’s more…

Controlling uncertainty with entropy_cutoff

Controlling tree depth with depth_cutoff

Controlling decisions with support_cutoff

See also

Summary

Resources for Article:

Recent Posts

Top life hacks for prepping for your IT certification exam

Learn Transformers for Natural Language Processing with Denis Rothman

Learning Essential Linux Commands for Navigating the Shell Effectively

Clean Coding in Python with Mariano Anaya

Exploring Forms in Angular – types, benefits and differences

Gain Practical Expertise with the Latest Edition of Software Architecture with C# 9 and .NET 5

Classifying Text

Bag of words feature extraction

How to do it…

How it works…

There’s more…

Filtering stopwords

Including significant bigrams

See also

Training a Naive Bayes classifier

Getting ready

How to do it…

How it works…

There’s more…

Classification probability

Most informative features

Training estimator

Manual training

See also

Training a decision tree classifier

How to do it…

How it works…

There’s more…

Controlling uncertainty with entropy_cutoff

Controlling tree depth with depth_cutoff

Controlling decisions with support_cutoff

See also

Summary

Resources for Article:

Related Post

Recent Posts

Top life hacks for prepping for your IT certification exam

Learn Transformers for Natural Language Processing with Denis Rothman

Learning Essential Linux Commands for Navigating the Shell Effectively

Clean Coding in Python with Mariano Anaya

Exploring Forms in Angular – types, benefits and differences

Gain Practical Expertise with the Latest Edition of Software Architecture with C# 9 and .NET 5

Exploring Forms in Angular – types, benefits and differences