Home Data Tutorials Implementing the Naïve Bayes classifier in Mahout

Implementing the Naïve Bayes classifier in Mahout

December 26, 2013 - 12:00 am

2147

20 min read

(for more resources related to this topic, see here.)

Bayes was a Presbyterian priest who died giving his “Tractatus Logicus” to the prints in 1795. The interesting fact is that we had to wait a whole century for the Boolean calculus before Bayes’ work came to light in the scientific community.

The corpus of Bayes‘ study was conditional probability. Without entering too much into mathematical theory, we define conditional probability as the probability of an event that depends on the outcome of another event.

In this article, we are dealing with a particular type of algorithm, a classifier algorithm. Given a dataset, that is, a set of observations of many variables, a classifier is able to assign a new observation to a particular category. So, for example, consider the following table:

Outlook	Temperature	Temperature	Humidity	Humidity	Windy	Play
	Numeric	Nominal	Numeric	Nominal
Overcast	83	Hot	86	High	FALSE	Yes
Overcast	64	Cool	65	Normal	TRUE	Yes
Overcast	72	Mild	90	High	TRUE	Yes
Overcast	81	Hot	75	Normal	FALSE	Yes
Rainy	70	Mild	96	High	FALSE	Yes
Rainy	68	Cool	80	Normal	FALSE	Yes
Rainy	65	Cool	70	Normal	TRUE	No
Rainy	75	Mild	80	Normal	FALSE	Yes
Rainy	71	Mild	91	High	TRUE	No
Sunny	85	Hot	85	High	FALSE	No
Sunny	80	Hot	90	High	TRUE	No
Sunny	72	Mild	95	High	FALSE	No
Sunny	69	Cool	70	Normal	FALSE	Yes
Sunny	75	Mild	70	Normal	TRUE	Yes

The table itself is composed of a set of 14 observations consisting of 7 different categories: temperature (numeric), temperature (nominal), humidity (numeric), and so on. The classifier takes some of the observations to train the algorithm and some as testing it, to create a decision for a new observation that is not contained in the original dataset.

There are many types of classifiers that can do this kind of job. The classifier algorithms are part of the supervised learning data-mining tasks that use training data to infer an outcome. The Naïve Bayes classifier uses the assumption that the fact, on observation, belongs to a particular category and is independent from belonging to any other category.

Other types of classifiers present in Mahout are the logistic regression, random forests, and boosting. Refer to the page https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms for more information.

This page is updated with the algorithm type, actual integration in Mahout, and other useful information. Moving out of this context, we could describe the Naïve Bayes algorithm as a classification algorithm that uses the conditional probability to transform an initial set of weights into a weight matrix, whose entries (row by column) detail the probability that one weight is associated to the other weight. In this article’s recipes, we will use the same algorithm provided by the Mahout example source code that uses the Naïve Bayes classifier to find the relation between works of a set of documents.

Our recipe can be easily extended to any kind of document or set of documents. We will only use the command line so that once the environment is set up, it will be easy for you to reproduce our recipe. Our dataset is divided into two parts: the training set and the testing set. The training set is used to instruct the algorithm on the relation it needs to find. The testing set is used to test the algorithm using some unrelated input. Let us now get a first-hand taste of how to use the Naïve Bayes classifier.

Using the Mahout text classifier to demonstrate the basic use case

The Mahout binaries contain ready-to-use scripts for using and understanding the classical Mahout dataset. We will use this dataset for testing or coding. Basically, the code is nothing more than following the Mahout ready-to-use script with the corrected parameter and the path settings done. This recipe will describe how to transform the raw text files into weight vectors that are needed by the Naïve Bayes algorithm to create the model.

The steps involved are the following:

Converting the raw text file into a sequence file
Creating vector files from the sequence files
Creating our working vectors

Getting ready

The first step is to download the datasets. The dataset is freely available at the following link: http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz.

For classification purposes, other datasets can be found at the following URL: http://sci2s.ugr.es/keel/category.php?cat=clas#sub2.

The dataset contains a post of 20 newsgroups dumped in a text file for the purpose of machine learning. Anyway, we could have also used other documents for testing purposes, but we will suggest how to do this later in the recipe.

Before proceeding, in the command line, we need to set up the working folder where we decompress the original archive to have shorter commands when we need to insert the full path of the folder.

In our case, the working folder is /mnt/new; so, our working folder’s command-line variables will be set using the following command:

export WORK_DIR=/mnt/new/

You can create a new folder and change the WORK_DIR bash variable accordingly.

Do not forget that to have these examples running, you need to run the various commands with a user that has the HADOOP_HOME and MAHOUT_HOME variables in its path.

To download the dataset, we only need to open up a terminal console and give the following command:

wget http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz

Once your working dataset is downloaded, decompress it using the following command:

tar –xvzf 20news-bydate.tar.gz

You should see the folder structure as shown in the following screenshot:

Apache Mahout Cookbook

The second step is to sequence the whole input file to transform them into Hadoop sequence files. To do this, you need to transform the two folders into a single one. However, this is only a pedagogical passage, but if you have multiple files containing the input texts, you could parse them separately by invoking the command multiple times. Using the console command, we can group them together as a whole by giving the following command in sequence:

rm -rf ${WORK_DIR}/20news-all
 mkdir ${WORK_DIR}/20news-all
 cp -R ${WORK_DIR}/20news-bydate*/*/* ${WORK_DIR}/20news-all

Now, we should have our input folder, which is the 20news-all folder, ready to be used:

Apache Mahout Cookbook

The following screenshot shows a bunch of files, all in the same folder:

Apache Mahout Cookbook

By looking at one single file, we should see the underlying structure that we will transform. The structure is as follows:

From: xxx
Subject: yyyyy
Organization: zzzz
X-Newsreader: rusnews v1.02
Lines: 50
jaeger@xxx (xxx) writes:
>In article xxx writes:
>>zzzz "How BCCI adapted the Koran rules of banking".  The 
>>Times.  August 13, 1991.
> 
> So, let's see. If some guy writes a piece with a title that implies
> something is the case then it must be so, is that it?

We obviously removed the e-mail address, but you can open this file to see its content. For any newsgroup of 20 news items that are present on the dataset, we have a number of files, each of them containing a single post to a newsgroup without categorization.

Following our initial tasks, we need to now transform all these files into Hadoop sequence files. To do this, you need to just type the following command:

./mahout seqdirectory  -i ${WORK_DIR}/20news-all  -o ${WORK_DIR}/20news-seq

This command brings every file contained in the 20news-all folder and transforms them into a sequence file. As you can see, the number of corresponding sequence files is not one to one with the number of input files. In our case, the generated sequence files from the original 15417 text files are just one chunck-0 file. It is also possible to declare the number of output files and the mappers involved in this data transformation. We invite the reader to test the different parameters and their uses by invoking the following command:

./mahout seqdirectory --help

The following table describes the various options that can be used with the seqdirectory command:

Parameter	Description
–input (-i) input	his gives the path to the job input directory.
–output (-o) output	The directory pathname for the output.
–overwrite (-ow)	If present, overwrite the output directory before running the job.
–method (-xm) method	The execution method to use: sequential or mapreduce. The default is mapreduce.
–chunkSize (-chunk) chunkSize	The chunkSize values in megabyte. The default is 64 Mb.
–fileFilterClass (-filter) fileFilterClass	The name of the class to use for file parsing.The default is org.apache.mahout.text.PrefixAdditionFilter.
–keyPrefix (-prefix) keyPrefix	The prefix to be prepended to the key of the sequence file.
–charset (-c) charset	The name of the character encoding of the input files.The default is UTF-8.
–overwrite (-ow)	If present, overwrite the output directory before running the job.
–help (-h)	Prints the help menu to the command console.
–tempDir tempDir	If specified, tells Mahout to use this as a temporary folder.
–startPhase startPhase	Defines the first phase that needs to be run.
–endPhase endPhase	Defines the last phase that needs to be run

To examine the outcome, you can use the Hadoop command-line option fs. So, for example, if you would like to see what is in the chunck-0 file, you could type in the following command:

hadoop fs -text $WORK_DIR/20news-seq/chunck-0 | more

In our case, the result is as follows:

/67399 From:xxx
Subject: Re: Imake-TeX: looking for beta testers
Organization: CS Department, Dortmund University, Germany
Lines: 59
Distribution: world
NNTP-Posting-Host: tommy.informatik.uni-dortmund.de

In article <xxxxx>,
yyy writes:
|> As I announced at the X Technical Conference in January, I would 
like
|> to
|> make Imake-TeX, the Imake support for using the TeX typesetting 
system,
|> publically available. Currently Imake-TeX is in beta test here at 
the
|> computer science department of Dortmund University, and I am 
looking
|> for
|> some more beta testers, preferably with different TeX and Imake
|> installations.

The Hadoop command is pretty simple, and the syntax is as follows:

hadoop fs –text <input file>

In the preceding syntax, <input file> is the sequence file whose content you will see. Our sequence files have been created, and until now, there has been no analysis of the words and the text itself. The Naïve Bayes algorithm does not work directly with the words and the raw text, but with the weighted vector associated to the original document. So now, we need to transform the raw text into vectors of weights and frequency. To do this, we type in the following command:

./mahout seq2sparse  -i ${WORK_DIR}/20news-seq   -o ${WORK_DIR}/20news-
vectors  -lnorm -nv  -wt tfidf

The following command parameters are described briefly:

The -lnorm parameter instructs the vector to use the L_2 norm as a distance
The -nv parameter is an optional parameter that outputs the vector as namedVector
The -wt parameter instructs which weight function needs to be used

We end the data-preparation process with this step. Now, we have the weight vector files that are created and ready to be used by the Naïve Bayes algorithm. We will clear a little while this last step algorithm. This part is about tuning the algorithm for better performance of the Naïve Bayes classifier.

How to do it…

Now that we have generated the weight vectors, we need to give them to the training algorithm. But if we train the classifier against the whole set of data, we will not be able to test the accuracy of the classifier.

To avoid this, you need to divide the vector files into two sets called the 80-20 split. This is a good data-mining approach because if you have any algorithm that should be instructed on a dataset, you should divide the whole bunch of data into two sets: one for training and one for testing your algorithm.

A good dividing percentage is shown to be 80 percent and 20 percent, meaning that the training data should be 80 percent of the total while the testing ones should be the remaining 20 percent.

To split data, we use the following command:

./mahout split  
-i ${WORK_DIR}/20news-vectors/tfidf-vectors 
    --trainingOutput ${WORK_DIR}/20news-train-vectors 
    --testOutput ${WORK_DIR}/20news-test-vectors  
    --randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential

As result of this command, we will have two new folders containing the training and testing vectors. Now, it is time to train our Naïves Bayes algorithm on the training set of vectors, and the command that is used is pretty easy:

./mahout trainnb 
    -i ${WORK_DIR}/20news-train-vectors -el 
    -o ${WORK_DIR}/model 
    -li ${WORK_DIR}/labelindex 
    -ow

Once finished, we have our training model ready to be tested against the remaining 20 percent of the initial input vectors. The final console command is as follows:

  ./mahout testnb 
    -i ${WORK_DIR}/20news-test-vectors
    -m ${WORK_DIR}/model 
    -l ${WORK_DIR}/labelindex\
    -ow -o ${WORK_DIR}/20news-testing

The following screenshot shows the output of the preceding command:

Apache Mahout Cookbook

How it works…

We have given certain commands and we have seen the outcome, but you’ve done this without an understanding of why we did it and above all, why we chose certain parameters. The whole sequence could be meaningless, even for an experienced coder.

Let us now go a little deeper in each step of our algorithm. Apart from downloading the data, we can divide our Naïve Bayes algorithm into three main steps:

Data preparation
Data training
Data testing

In general, these are the three procedures for mining data that should be followed. The data preparation steps involve all the operations that are needed to create the dataset in the format that is required for the data mining procedure. In this case, we know that the original format was a bunch of files containing text, and we transformed them into a sequence file format. The main purpose of this is to have a format that can be handled by the map reducing algorithm. This phase is a general one as the input format is not ready to be used as it is in most cases. Sometimes, we also need to merge some data if they are divided into different sources. Sometimes, we also need to use Sqoop for extracting data from different datasources.

Data training is the crucial part; from the original dataset, we extract the information that is relevant to our data mining tasks, and we bring some of them to train our model. In our case, we are trying to classify if a document can be inserted in a certain category based on the frequency of some terms in it. This will lead to a classifier that using another document can state if this document is under a previously found category. The output is a function that is able to determinate this association.

Next, we need to evaluate this function because it is possible that one good classification in the learning phase is not so good when using a different document. This three-phased approach is essential in all classification tasks. The main difference relies on the type of classifier to be used in the training and testing phase. In this case, we use Naïve Bayes, but other classifiers can be used as well. In the Mahout framework, the available classifiers are Naïve Bayes, Decision Forest, and Logistic Regression.

As we have seen, the data preparation consists basically of creating two series of files that will be used for training and testing purposes. The step to transform the raw text file into a Hadoop sequence format is pretty easy; so, we won’t spend too long on it. But the next step is the most important one during data preparation. Let us recall it:

mahout seq2sparse  -i ${WORK_DIR}/20news-seq -o ${WORK_DIR}/20news-
vectors  -lnorm -nv  -wt tfidf

This computational step basically grabs the whole text from the chunck-0 sequence file and starts parsing it to extract information from the words contained in it. The input parameters tell the utility to work in the following ways:

The -i parameter is used to declare the input folder where all the sequence files are stored
The -o parameter is used to create the output folder where the vector containing the weights is stored
The -nv parameter tells Mahout that the output format should be in the namedVector format
The -wt parameter tells which frequency function to use for evaluating the weight of every term to a category
The -lnorm parameter is a function used to normalize the weights using the L_2 distance
The -ow: parameter overwrites the previously generated output results
The -m: parameter gives the minimum log-likelihood ratio

The whole purpose of this computation step is to transform the sequence files that contain the documents’ raw text in the sequence files containing vectors that count the frequency of the term. Obviously, there are some different functions that count the frequency of a term within the whole set of documents. So, in Mahout, the possible values for the wt parameter are tf and tfidf. The Tf value is the simpler one and counts the frequency of the term. This means that the frequency of the W_i term inside the set of documents is the ratio between the total occurrence of the word over the total number of words. The second one considers the sum of every term frequency using a logarithmic function like this one:

Apache Mahout Cookbook

In the preceding formula, W_i is the TF-IDF weight of the word indexed by i. N is the total number of documents. DF_i is the frequency of the i word in all the documents.

In this preprocessing phase, we notice that we index the whole corpus of documents so that we are sure that even if we divide or split in the next phase, the documents are not affected. We compute a word frequency; this means that the word was contained in the training or testing set.

So, the reader should grasp the fact that changing this parameter can affect the final weight vectors; so, based on the same text, we could have very different outcomes.

The lnorm value basically means that while the weight can be a number ranging from 0 to an upper positive integer, they are normalized to 1 as the maximum possible weight for a word inside the frequency range. The following screenshot shows the output of the output folder:

Apache Mahout Cookbook

Various folders are created for storing the word count, frequency, and so on. Basically, this is because the Naïve Bayes classifier works by removing all periods and punctuation marks from the text. Then, from every text, it extracts the categories and the words.

The final vector file can be seen in the tfidf-vectors folder, and for dumping vector files to normal text ones, you can use the vectordump command as follows:

mahout vectordump  -i   ${WORK_DIR}/20news-vectors/tfidf-vectors/
part-r-00000 –o
${WORK_DIR}/20news-vectors/tfidf-vectors/part-r-00000dump

The dictionary files and word files are sequence files containing the association within the unique key/word created by the MapReduce algorithm using the command:

hadoop fs -text $WORK_DIR/20news-vectors/dictionary.file-0 | more one can see for example adrenal_gland 12912 adrenaline 12913 adrenaline.com 12914|

The splitting of the dataset into training and testing is done by using the split command-line option of Mahout. The interesting parameter in this case is that randomSelectionPct equals 40. It uses a random selection to evaluate which point belongs to the training or the testing dataset.

Now comes the interesting part. We are ready to train using the Naïve Bayes algorithm. The output of this algorithm is the model folder that contains the model in the form of a binary file. This file represents the Naïve Bayes model that holds the weight Matrix, the feature and label sums, and the weight normalizer vectors generated so far.

Now that we have the model, we test it on the training set. The outcome is directly shown on the command line in terms of a confusion matrix. The following screenshot shows the format in which we can see our result.

Finally, we test our classifier on the test vector generated by the split instruction. The output in this case is a confusion matrix. Its format is as shown in the following screenshot:

Apache Mahout Cookbook

We are now going to provide details on how this matrix should be interpreted. As you can see, we have the total classified instances that tell us how many sentences have been analyzed. Above this, we have the correctly/incorrectly classified instances. In our case, this means that on a test set of weighted vectors, we have nearly 90 percent of the corrected classified sentences against an error of 9 percent.

But if we go through the matrix row by row, we can see at the end that we have different newsgroups. So, a is equal to alt.atheism and b is equal to comp.graphics.

So, a first look at the detailed confusion matrix tells us that we did the best in classification against the rec.sport.hockey newsgroup, with a value of 418 that is the highest we have. If we take a look at the corresponding row, we understand that of these 418 classified sentences, we have 403/412; so, 97 percent of all of the sentences were found in the rec.sport.hockey newsgroup. But if we take a look at the comp.os.ms-windows.miscwe newsgroup, we can see overall performance is low. The sentences are not so centered around the same new newsgroup; so, it means that we find and classify the sentences in ms-windows in another newsgroup, and so we do not have a good classification.

This is reasonable as sports terms like “hockey” are really limited to the hockey world, while sentences about Microsoft could be found both on Microsoft specific newsgroups and in other newsgroups.

We encourage you to give another run to the testing phase on the training phase to see the output of the confusion matrix by giving the following command:

  ./bin/mahout testnb
    -i ${WORK_DIR}/20news-train-vectors
    -m ${WORK_DIR}/model
    -l ${WORK_DIR}/labelindex
    -ow -o ${WORK_DIR}/20news-testing

As you can see, the input folder is the same for the training phase, and in this case, we have the following confusion matrix:

Apache Mahout Cookbook

In this case, we can see it using the same set both as the training and testing phase. The first consequence is that we have a rise in the correctly classified sentences by an order of 10 percent, which is even bigger if you remember that in terms of weighted vectors with respect to the testing phase, we have a size that is four times greater. But probably the most important thing is that the best classification has now moved from the hockey newsgroup to the sci.electronics newsgroup.

There’s more

We use exactly the same procedure used by the Mahout examples contained in the binaries folder that we downloaded. But you should now be aware that starting all process need only to change the input files from the initial folder. So, for the willing reader, we suggest you download another raw text file and perform all the steps in another type of file to see the changes that we have compared to the initial input text.

We would suggest that non-native English readers also look at the differences that we have by changing the initial input set with one not written in English. Since the whole text is transformed using only weight vectors, the outcome does not depend on the difference between languages but only on the probability of finding certain word couples.

As a final step, using the same input texts, you could try to change the way the algorithm normalizes and counts the words to create the vector sparse weights. This could be easily done by changing, for example, the -wt tfidf parameter into the command line Mahout seq2sparce. So, for example, an alternative run of the seq2sparce Mahout could be the following one:

mahout seq2sparse  -i ${WORK_DIR}/20news-seq -o ${WORK_DIR}/20news-
vectors  -lnorm -nv  -wt tfidf

Finally, we not only choose to run the Naïve Bayes classifier for classifying words in a text document but also the algorithm that uses vectors of weights so that, for example, it would be easy to create your own vector weights.

Top 6 Cybersecurity Books from Packt to Accelerate Your Career

Your Quick Introduction to Extended Events in Analysis Services from Blog…

Logging the history of my past SQL Saturday presentations from Blog…

Storage savings with Table Compression from Blog Posts – SQLServerCentral

Daily Coping 31 Dec 2020 from Blog Posts – SQLServerCentral

Learning Essential Linux Commands for Navigating the Shell Effectively

Exploring the Strategy Behavioral Design Pattern in Node.js

How to integrate a Medium editor in Angular 8

Implementing memory management with Golang’s garbage collector

How to create sales analysis app in Qlik Sense using DAR…

Implementing the Naïve Bayes classifier in Mahout

Using the Mahout text classifier to demonstrate the basic use case

Getting ready

How to do it…

How it works…

There’s more

LEAVE A REPLY Cancel reply

MobilePro

datapro

Programming

Subscribe to our newsletter