9 min read

[box type=”note” align=”” class=”” width=””]This article is an excerpt taken from the book Natural Language Processing with Python Cookbook written by Krishna Bhavsar, Naresh Kumar, and Pratap Dangeti. This book will teach you how to efficiently use NLTK and implement text classification, identify parts of speech, tag words, and more. You will also learn how to analyze sentence structures and master lexical analysis, syntactic and semantic analysis, pragmatic analysis, and application of deep learning techniques.[/box]

In this article, you will learn how to use deep neural networks to classify emails into one of the 20 pre-trained categories based on the words present in each email. This is a simple model to start with understanding the subject of deep learning and its applications on NLP.

Getting ready

The 20 newsgroups dataset from scikit-learn have been utilized to illustrate the concept. Number of observations/emails considered for analysis are 18,846 (train observations – 11,314 and test observations – 7,532) and its corresponding classes/categories are 20, which are shown in the following:

>>> from sklearn.datasets import fetch_20newsgroups

>>> newsgroups_train = fetch_20newsgroups(subset='train')

>>> newsgroups_test = fetch_20newsgroups(subset='test')

>>> x_train = newsgroups_train.data

>>> x_test = newsgroups_test.data

>>> y_train = newsgroups_train.target

>>> y_test = newsgroups_test.target

>>> print ("List of all 20 categories:")

>>> print (newsgroups_train.target_names)

>>> print ("n")

>>> print ("Sample Email:")

>>> print (x_train[0])

>>> print ("Sample Target Category:")

>>> print (y_train[0])

>>> print (newsgroups_train.target_names[y_train[0]])

In the following screenshot, a sample first data observation and target class category has been shown. From the first observation or email we can infer that the email is talking about a two-door sports car, which we can classify manually into autos category which is 8.

Note: Target value is 7 due to the indexing starts from 0), which is validating our understanding with actual target class 7.

List of 20 categories

How to do it…

Using NLP techniques, we have pre-processed the data for obtaining finalized word vectors to map with final outcomes spam or ham. Major steps involved are:

  •    Pre-processing.
  •    Removal of punctuations.
  •    Word tokenization.
  •    Converting words into lowercase.
  •    Stop word removal.
  •    Keeping words of length of at least 3.
  •    Stemming words.
  •    POS tagging.
  •    Lemmatization of words:
  1. TF-IDF vector conversion.
  2. Deep learning model training and testing.
  3. Model evaluation and results discussion.

How it works…

The NLTK package has been utilized for all the pre-processing steps, as it consists of all the necessary NLP functionality under one single roof:

# Used for pre-processing data

>>> import nltk

>>> from nltk.corpus import stopwords

>>> from nltk.stem import WordNetLemmatizer

>>> import string

>>> import pandas as pd

>>> from nltk import pos_tag

>>> from nltk.stem import PorterStemmer

The function written (pre-processing) consists of all the steps for convenience. However, we will be explaining all the steps in each section:

>>> def preprocessing(text):

The following line of the code splits the word and checks each character to see if it contains any standard punctuations, if so it will be replaced with a blank or else it just don’t replace with blank:

... text2 = " ".join("".join([" " if ch in string.punctuation else ch for ch in text]).split())

The following code tokenizes the sentences into words based on whitespaces and puts them together as a list for applying further steps:

... tokens = [word for sent in nltk.sent_tokenize(text2) for word in nltk.word_tokenize(sent)]

Converting all the cases (upper, lower and proper) into lower case reduces duplicates in corpus:

... tokens = [word.lower() for word in tokens]

As mentioned earlier, Stop words are the words that do not carry much of weight in understanding the sentence; they are used for connecting words and so on. We have removed them with the following line of code:

... stopwds = stopwords.words('english')

... tokens = [token for token in tokens if token not in stopwds]

Keeping only the words with length greater than 3 in the following code for removing small words which hardly consists of much of a meaning to carry;

... tokens = [word for word in tokens if len(word)>=3]

Stemming applied on the words using Porter stemmer which stems the extra suffixes from the words:

... stemmer = PorterStemmer()

... tokens = [stemmer.stem(word) for word in tokens]

POS tagging is a prerequisite for lemmatization, based on whether word is noun or verb or and so on. it will reduce it to the root word

... tagged_corpus = pos_tag(tokens)

pos_tag function returns the part of speed in four formats for Noun and six formats for verb. NN – (noun, common, singular), NNP – (noun, proper, singular), NNPS – (noun, proper, plural), NNS – (noun, common, plural), VB – (verb, base form), VBD – (verb, past tense), VBG – (verb, present participle), VBN – (verb, past participle), VBP – (verb, present tense, not 3rd person singular), VBZ – (verb, present tense, third person singular)

... Noun_tags = ['NN','NNP','NNPS','NNS']

... Verb_tags = ['VB','VBD','VBG','VBN','VBP','VBZ']

... lemmatizer = WordNetLemmatizer()

The following function, prat_lemmatize, has been created only for the reasons of mismatch between the pos_tag function and intake values of lemmatize function. If the tag for any word falls under the respective noun or verb tags category, n or v will be applied accordingly in lemmatize function:

... def prat_lemmatize(token,tag):

...      if tag in Noun_tags:

...          return lemmatizer.lemmatize(token,'n')

...      elif tag in Verb_tags:

...          return lemmatizer.lemmatize(token,'v')

...      else:

...          return lemmatizer.lemmatize(token,'n')

After performing tokenization and applied all the various operations, we need to join it back to form stings and the following function performs the same:

... pre_proc_text =   " ".join([prat_lemmatize(token,tag) for token,tag in tagged_corpus])

... return pre_proc_text

Applying pre-processing on train and test data:

>>> x_train_preprocessed = []

>>> for i in x_train:

... x_train_preprocessed.append(preprocessing(i))

>>> x_test_preprocessed = []

>>> for i in x_test:

... x_test_preprocessed.append(preprocessing(i))

# building TFIDF vectorizer

>>> from sklearn.feature_extraction.text import TfidfVectorizer

>>> vectorizer = TfidfVectorizer(min_df=2, ngram_range=(1, 2), stop_words='english', max_features= 10000,strip_accents='unicode', norm='l2')

>>> x_train_2 = vectorizer.fit_transform(x_train_preprocessed).todense()

>>> x_test_2 = vectorizer.transform(x_test_preprocessed).todense()

After the pre-processing step has been completed, processed TF-IDF vectors have to be sent to the following deep learning code:

# Deep Learning modules

>>> import numpy as np

>>> from keras.models import Sequential

>>> from keras.layers.core import Dense, Dropout, Activation

>>> from keras.optimizers import Adadelta,Adam,RMSprop

>>> from keras.utils import np_utils

The following image produces the output after firing up the preceding Keras code. Keras has been installed on Theano, which eventually works on Python. A GPU with 6 GB memory has been installed with additional libraries (CuDNN and CNMeM) for four to five times faster execution, with a choking of around 20% memory; hence only 80% memory out of 6 GB is available;

Output after preceding Keras

The following code explains the central part of the deep learning model. The code is self- explanatory, with the number of classes considered 20, batch size 64, and number of epochs to train, 20:

# Definition hyper parameters

>>> np.random.seed(1337)

>>> nb_classes = 20

>>> batch_size = 64

>>> nb_epochs = 20

The following code converts the 20 categories into one-hot encoding vectors in which 20 columns are created and the values against the respective classes are given as 1. All other classes are given as 0:

>>> Y_train = np_utils.to_categorical(y_train, nb_classes)

In the following building blocks of Keras code, three hidden layers (1000, 500, and 50 neurons in each layer respectively) are used, with dropout as 50% for each layer with Adam as an optimizer:

#Deep Layer Model building in Keras

#del model

>>> model = Sequential()

>>> model.add(Dense(1000,input_shape= (10000,)))

>>> model.add(Activation('relu'))

>>> model.add(Dropout(0.5))

>>> model.add(Dense(500))

>>> model.add(Activation('relu'))

>>> model.add(Dropout(0.5))

>>> model.add(Dense(50))

>>> model.add(Activation('relu'))

>>> model.add(Dropout(0.5))

>>> model.add(Dense(nb_classes))

>>> model.add(Activation('softmax'))

>>> model.compile(loss='categorical_crossentropy', optimizer='adam')

>>> print (model.summary())

The architecture is shown as follows and describes the flow of the data from a start of 10,000 as input. Then there are 1000, 500, 50, and 20 neurons to classify the given email into one of the 20 categories:

Model trained

The model is trained as per the given metrics:

# Model Training

>>> model.fit(x_train_2, Y_train, batch_size=batch_size, epochs=nb_epochs,verbose=1)

The model has been fitted with 20 epochs, in which each epoch took about 2 seconds. The loss has been minimized from 1.9281 to 0.0241. By using CPU hardware, the time required for training each epoch may increase as a GPU massively parallelizes the computation with thousands of threads/cores:

Predictions

Finally, predictions are made on the train and test datasets to determine the accuracy, precision, and recall values:

#Model Prediction

>>> y_train_predclass = model.predict_classes(x_train_2,batch_size=batch_size)

>>> y_test_predclass = model.predict_classes(x_test_2,batch_size=batch_size)

>>> from sklearn.metrics import accuracy_score,classification_report

>>> print ("nnDeep Neural Network - Train accuracy:"),(round(accuracy_score( y_train, y_train_predclass),3))

>>> print ("nDeep Neural Network - Test accuracy:"),(round(accuracy_score( y_test,y_test_predclass),3))

>>> print ("nDeep Neural Network - Train Classification Report")

>>> print (classification_report(y_train,y_train_predclass))

>>> print ("nDeep Neural Network - Test Classification Report")

>>> print (classification_report(y_test,y_test_predclass))

image classifier

It appears that the classifier is giving a good 99.9% accuracy on the train dataset and 80.7% on the test dataset.

We learned the classification of emails using DNNs(Deep Neural Networks) after generating TF-IDF.

If you found this post useful, do check out this book Natural Language Processing with Python Cookbook  to further analyze sentence structures and application of various deep learning techniques.

NLP with Python Cookbook

 

 

A Data science fanatic. Loves to be updated with the tech happenings around the globe. Loves singing and composing songs. Believes in putting the art in smart.

LEAVE A REPLY

Please enter your comment!
Please enter your name here