20 min read

(for more resources related to this topic, see here.)

Bayes was a Presbyterian priest who died giving his “Tractatus Logicus” to the prints in 1795. The interesting fact is that we had to wait a whole century for the Boolean calculus before Bayes’ work came to light in the scientific community.

The corpus of Bayes‘ study was conditional probability. Without entering too much into mathematical theory, we define conditional probability as the probability of an event that depends on the outcome of another event.

In this article, we are dealing with a particular type of algorithm, a classifier algorithm. Given a dataset, that is, a set of observations of many variables, a classifier is able to assign a new observation to a particular category. So, for example, consider the following table:

Outlook

Temperature

Temperature

Humidity

Humidity

Windy

Play


Numeric

Nominal

Numeric

Nominal



Overcast

83

Hot

86

High

FALSE

Yes

Overcast

64

Cool

65

Normal

TRUE

Yes

Overcast

72

Mild

90

High

TRUE

Yes

Overcast

81

Hot

75

Normal

FALSE

Yes

Rainy

70

Mild

96

High

FALSE

Yes

Rainy

68

Cool

80

Normal

FALSE

Yes

Rainy

65

Cool

70

Normal

TRUE

No

Rainy

75

Mild

80

Normal

FALSE

Yes

Rainy

71

Mild

91

High

TRUE

No

Sunny

85

Hot

85

High

FALSE

No

Sunny

80

Hot

90

High

TRUE

No

Sunny

72

Mild

95

High

FALSE

No

Sunny

69

Cool

70

Normal

FALSE

Yes

Sunny

75

Mild

70

Normal

TRUE

Yes

The table itself is composed of a set of 14 observations consisting of 7 different categories: temperature (numeric), temperature (nominal), humidity (numeric), and so on. The classifier takes some of the observations to train the algorithm and some as testing it, to create a decision for a new observation that is not contained in the original dataset.

There are many types of classifiers that can do this kind of job. The classifier algorithms are part of the supervised learning data-mining tasks that use training data to infer an outcome. The Naïve Bayes classifier uses the assumption that the fact, on observation, belongs to a particular category and is independent from belonging to any other category.

Other types of classifiers present in Mahout are the logistic regression, random forests, and boosting. Refer to the page https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms for more information.

This page is updated with the algorithm type, actual integration in Mahout, and other useful information. Moving out of this context, we could describe the Naïve Bayes algorithm as a classification algorithm that uses the conditional probability to transform an initial set of weights into a weight matrix, whose entries (row by column) detail the probability that one weight is associated to the other weight. In this article’s recipes, we will use the same algorithm provided by the Mahout example source code that uses the Naïve Bayes classifier to find the relation between works of a set of documents.

Our recipe can be easily extended to any kind of document or set of documents. We will only use the command line so that once the environment is set up, it will be easy for you to reproduce our recipe. Our dataset is divided into two parts: the training set and the testing set. The training set is used to instruct the algorithm on the relation it needs to find. The testing set is used to test the algorithm using some unrelated input. Let us now get a first-hand taste of how to use the Naïve Bayes classifier.

Using the Mahout text classifier to demonstrate the basic use case

The Mahout binaries contain ready-to-use scripts for using and understanding the classical Mahout dataset. We will use this dataset for testing or coding. Basically, the code is nothing more than following the Mahout ready-to-use script with the corrected parameter and the path settings done. This recipe will describe how to transform the raw text files into weight vectors that are needed by the Naïve Bayes algorithm to create the model.

The steps involved are the following:

  • Converting the raw text file into a sequence file
  • Creating vector files from the sequence files
  • Creating our working vectors

Getting ready

The first step is to download the datasets. The dataset is freely available at the following link: http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz.

For classification purposes, other datasets can be found at the following URL: http://sci2s.ugr.es/keel/category.php?cat=clas#sub2.

The dataset contains a post of 20 newsgroups dumped in a text file for the purpose of machine learning. Anyway, we could have also used other documents for testing purposes, but we will suggest how to do this later in the recipe.

Before proceeding, in the command line, we need to set up the working folder where we decompress the original archive to have shorter commands when we need to insert the full path of the folder.

In our case, the working folder is /mnt/new; so, our working folder’s command-line variables will be set using the following command:

export WORK_DIR=/mnt/new/

You can create a new folder and change the WORK_DIR bash variable accordingly.

Do not forget that to have these examples running, you need to run the various commands with a user that has the HADOOP_HOME and MAHOUT_HOME variables in its path.

To download the dataset, we only need to open up a terminal console and give the following command:

wget http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz

Once your working dataset is downloaded, decompress it using the following command:

tar –xvzf 20news-bydate.tar.gz

You should see the folder structure as shown in the following screenshot:

The second step is to sequence the whole input file to transform them into Hadoop sequence files. To do this, you need to transform the two folders into a single one. However, this is only a pedagogical passage, but if you have multiple files containing the input texts, you could parse them separately by invoking the command multiple times. Using the console command, we can group them together as a whole by giving the following command in sequence:

rm -rf ${WORK_DIR}/20news-all mkdir ${WORK_DIR}/20news-all cp -R ${WORK_DIR}/20news-bydate*/*/* ${WORK_DIR}/20news-all

Now, we should have our input folder, which is the 20news-all folder, ready to be used:

The following screenshot shows a bunch of files, all in the same folder:

By looking at one single file, we should see the underlying structure that we will transform. The structure is as follows:

From: xxx Subject: yyyyy Organization: zzzz X-Newsreader: rusnews v1.02 Lines: 50 jaeger@xxx (xxx) writes: >In article xxx writes: >>zzzz "How BCCI adapted the Koran rules of banking". The >>Times. August 13, 1991. > > So, let's see. If some guy writes a piece with a title that implies > something is the case then it must be so, is that it?

We obviously removed the e-mail address, but you can open this file to see its content. For any newsgroup of 20 news items that are present on the dataset, we have a number of files, each of them containing a single post to a newsgroup without categorization.

Following our initial tasks, we need to now transform all these files into Hadoop sequence files. To do this, you need to just type the following command:

./mahout seqdirectory -i ${WORK_DIR}/20news-all -o ${WORK_DIR}/20news-seq

This command brings every file contained in the 20news-all folder and transforms them into a sequence file. As you can see, the number of corresponding sequence files is not one to one with the number of input files. In our case, the generated sequence files from the original 15417 text files are just one chunck-0 file. It is also possible to declare the number of output files and the mappers involved in this data transformation. We invite the reader to test the different parameters and their uses by invoking the following command:

./mahout seqdirectory --help

The following table describes the various options that can be used with the seqdirectory command:

Parameter

Description

–input (-i) input

his gives the path to the job input directory.

–output (-o) output

The directory pathname for the output.

–overwrite (-ow)

If present, overwrite the output directory before running the job.

–method (-xm) method

The execution method to use: sequential or mapreduce. The default is mapreduce.

–chunkSize (-chunk) chunkSize

The chunkSize values in megabyte. The default is 64 Mb.

–fileFilterClass (-filter) fileFilterClass

The name of the class to use for file parsing.The default is org.apache.mahout.text.PrefixAdditionFilter.

–keyPrefix (-prefix) keyPrefix

The prefix to be prepended to the key of the sequence file.

–charset (-c) charset

The name of the character encoding of the input files.The default is UTF-8.

–overwrite (-ow)

If present, overwrite the output directory before running the job.

–help (-h)

Prints the help menu to the command console.

–tempDir tempDir

If specified, tells Mahout to use this as a temporary folder.

–startPhase startPhase

Defines the first phase that needs to be run.

–endPhase endPhase

Defines the last phase that needs to be run

To examine the outcome, you can use the Hadoop command-line option fs. So, for example, if you would like to see what is in the chunck-0 file, you could type in the following command:

hadoop fs -text $WORK_DIR/20news-seq/chunck-0 | more

In our case, the result is as follows:

/67399 From:xxx Subject: Re: Imake-TeX: looking for beta testers Organization: CS Department, Dortmund University, Germany Lines: 59 Distribution: world NNTP-Posting-Host: tommy.informatik.uni-dortmund.de In article <xxxxx>, yyy writes: |> As I announced at the X Technical Conference in January, I would like |> to |> make Imake-TeX, the Imake support for using the TeX typesetting system, |> publically available. Currently Imake-TeX is in beta test here at the |> computer science department of Dortmund University, and I am looking |> for |> some more beta testers, preferably with different TeX and Imake |> installations.

The Hadoop command is pretty simple, and the syntax is as follows:

hadoop fs –text <input file>

In the preceding syntax, <input file> is the sequence file whose content you will see. Our sequence files have been created, and until now, there has been no analysis of the words and the text itself. The Naïve Bayes algorithm does not work directly with the words and the raw text, but with the weighted vector associated to the original document. So now, we need to transform the raw text into vectors of weights and frequency. To do this, we type in the following command:

./mahout seq2sparse -i ${WORK_DIR}/20news-seq -o ${WORK_DIR}/20news- vectors -lnorm -nv -wt tfidf

The following command parameters are described briefly:

  • The -lnorm parameter instructs the vector to use the L_2 norm as a distance
  • The -nv parameter is an optional parameter that outputs the vector as namedVector
  • The -wt parameter instructs which weight function needs to be used

We end the data-preparation process with this step. Now, we have the weight vector files that are created and ready to be used by the Naïve Bayes algorithm. We will clear a little while this last step algorithm. This part is about tuning the algorithm for better performance of the Naïve Bayes classifier.

How to do it…

Now that we have generated the weight vectors, we need to give them to the training algorithm. But if we train the classifier against the whole set of data, we will not be able to test the accuracy of the classifier.

To avoid this, you need to divide the vector files into two sets called the 80-20 split. This is a good data-mining approach because if you have any algorithm that should be instructed on a dataset, you should divide the whole bunch of data into two sets: one for training and one for testing your algorithm.

A good dividing percentage is shown to be 80 percent and 20 percent, meaning that the training data should be 80 percent of the total while the testing ones should be the remaining 20 percent.

To split data, we use the following command:

./mahout split -i ${WORK_DIR}/20news-vectors/tfidf-vectors --trainingOutput ${WORK_DIR}/20news-train-vectors --testOutput ${WORK_DIR}/20news-test-vectors --randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential

As result of this command, we will have two new folders containing the training and testing vectors. Now, it is time to train our Naïves Bayes algorithm on the training set of vectors, and the command that is used is pretty easy:

./mahout trainnb -i ${WORK_DIR}/20news-train-vectors -el -o ${WORK_DIR}/model -li ${WORK_DIR}/labelindex -ow

Once finished, we have our training model ready to be tested against the remaining 20 percent of the initial input vectors. The final console command is as follows:

./mahout testnb -i ${WORK_DIR}/20news-test-vectors -m ${WORK_DIR}/model -l ${WORK_DIR}/labelindex\ -ow -o ${WORK_DIR}/20news-testing

The following screenshot shows the output of the preceding command:

How it works…

We have given certain commands and we have seen the outcome, but you’ve done this without an understanding of why we did it and above all, why we chose certain parameters. The whole sequence could be meaningless, even for an experienced coder.

Let us now go a little deeper in each step of our algorithm. Apart from downloading the data, we can divide our Naïve Bayes algorithm into three main steps:

  • Data preparation
  • Data training
  • Data testing

In general, these are the three procedures for mining data that should be followed. The data preparation steps involve all the operations that are needed to create the dataset in the format that is required for the data mining procedure. In this case, we know that the original format was a bunch of files containing text, and we transformed them into a sequence file format. The main purpose of this is to have a format that can be handled by the map reducing algorithm. This phase is a general one as the input format is not ready to be used as it is in most cases. Sometimes, we also need to merge some data if they are divided into different sources. Sometimes, we also need to use Sqoop for extracting data from different datasources.

Data training is the crucial part; from the original dataset, we extract the information that is relevant to our data mining tasks, and we bring some of them to train our model. In our case, we are trying to classify if a document can be inserted in a certain category based on the frequency of some terms in it. This will lead to a classifier that using another document can state if this document is under a previously found category. The output is a function that is able to determinate this association.

Next, we need to evaluate this function because it is possible that one good classification in the learning phase is not so good when using a different document. This three-phased approach is essential in all classification tasks. The main difference relies on the type of classifier to be used in the training and testing phase. In this case, we use Naïve Bayes, but other classifiers can be used as well. In the Mahout framework, the available classifiers are Naïve Bayes, Decision Forest, and Logistic Regression.

As we have seen, the data preparation consists basically of creating two series of files that will be used for training and testing purposes. The step to transform the raw text file into a Hadoop sequence format is pretty easy; so, we won’t spend too long on it. But the next step is the most important one during data preparation. Let us recall it:

mahout seq2sparse -i ${WORK_DIR}/20news-seq -o ${WORK_DIR}/20news- vectors -lnorm -nv -wt tfidf

This computational step basically grabs the whole text from the chunck-0 sequence file and starts parsing it to extract information from the words contained in it. The input parameters tell the utility to work in the following ways:

  • The -i parameter is used to declare the input folder where all the sequence files are stored
  • The -o parameter is used to create the output folder where the vector containing the weights is stored
  • The -nv parameter tells Mahout that the output format should be in the namedVector format
  • The -wt parameter tells which frequency function to use for evaluating the weight of every term to a category
  • The -lnorm parameter is a function used to normalize the weights using the L_2 distance
  • The -ow: parameter overwrites the previously generated output results
  • The -m: parameter gives the minimum log-likelihood ratio

The whole purpose of this computation step is to transform the sequence files that contain the documents’ raw text in the sequence files containing vectors that count the frequency of the term. Obviously, there are some different functions that count the frequency of a term within the whole set of documents. So, in Mahout, the possible values for the wt parameter are tf and tfidf. The Tf value is the simpler one and counts the frequency of the term. This means that the frequency of the Wi term inside the set of documents is the ratio between the total occurrence of the word over the total number of words. The second one considers the sum of every term frequency using a logarithmic function like this one:

In the preceding formula, Wi is the TF-IDF weight of the word indexed by i. N is the total number of documents. DFi is the frequency of the i word in all the documents.

In this preprocessing phase, we notice that we index the whole corpus of documents so that we are sure that even if we divide or split in the next phase, the documents are not affected. We compute a word frequency; this means that the word was contained in the training or testing set.

So, the reader should grasp the fact that changing this parameter can affect the final weight vectors; so, based on the same text, we could have very different outcomes.

The lnorm value basically means that while the weight can be a number ranging from 0 to an upper positive integer, they are normalized to 1 as the maximum possible weight for a word inside the frequency range. The following screenshot shows the output of the output folder:

Various folders are created for storing the word count, frequency, and so on. Basically, this is because the Naïve Bayes classifier works by removing all periods and punctuation marks from the text. Then, from every text, it extracts the categories and the words.

The final vector file can be seen in the tfidf-vectors folder, and for dumping vector files to normal text ones, you can use the vectordump command as follows:

mahout vectordump -i ${WORK_DIR}/20news-vectors/tfidf-vectors/ part-r-00000 –o ${WORK_DIR}/20news-vectors/tfidf-vectors/part-r-00000dump

The dictionary files and word files are sequence files containing the association within the unique key/word created by the MapReduce algorithm using the command:

hadoop fs -text $WORK_DIR/20news-vectors/dictionary.file-0 | more one can see for example adrenal_gland 12912 adrenaline 12913 adrenaline.com 12914|

The splitting of the dataset into training and testing is done by using the split command-line option of Mahout. The interesting parameter in this case is that randomSelectionPct equals 40. It uses a random selection to evaluate which point belongs to the training or the testing dataset.

Now comes the interesting part. We are ready to train using the Naïve Bayes algorithm. The output of this algorithm is the model folder that contains the model in the form of a binary file. This file represents the Naïve Bayes model that holds the weight Matrix, the feature and label sums, and the weight normalizer vectors generated so far.

Now that we have the model, we test it on the training set. The outcome is directly shown on the command line in terms of a confusion matrix. The following screenshot shows the format in which we can see our result.

Finally, we test our classifier on the test vector generated by the split instruction. The output in this case is a confusion matrix. Its format is as shown in the following screenshot:

We are now going to provide details on how this matrix should be interpreted. As you can see, we have the total classified instances that tell us how many sentences have been analyzed. Above this, we have the correctly/incorrectly classified instances. In our case, this means that on a test set of weighted vectors, we have nearly 90 percent of the corrected classified sentences against an error of 9 percent.

But if we go through the matrix row by row, we can see at the end that we have different newsgroups. So, a is equal to alt.atheism and b is equal to comp.graphics.

So, a first look at the detailed confusion matrix tells us that we did the best in classification against the rec.sport.hockey newsgroup, with a value of 418 that is the highest we have. If we take a look at the corresponding row, we understand that of these 418 classified sentences, we have 403/412; so, 97 percent of all of the sentences were found in the rec.sport.hockey newsgroup. But if we take a look at the comp.os.ms-windows.miscwe newsgroup, we can see overall performance is low. The sentences are not so centered around the same new newsgroup; so, it means that we find and classify the sentences in ms-windows in another newsgroup, and so we do not have a good classification.

This is reasonable as sports terms like “hockey” are really limited to the hockey world, while sentences about Microsoft could be found both on Microsoft specific newsgroups and in other newsgroups.

We encourage you to give another run to the testing phase on the training phase to see the output of the confusion matrix by giving the following command:

./bin/mahout testnb -i ${WORK_DIR}/20news-train-vectors -m ${WORK_DIR}/model -l ${WORK_DIR}/labelindex -ow -o ${WORK_DIR}/20news-testing

As you can see, the input folder is the same for the training phase, and in this case, we have the following confusion matrix:

In this case, we can see it using the same set both as the training and testing phase. The first consequence is that we have a rise in the correctly classified sentences by an order of 10 percent, which is even bigger if you remember that in terms of weighted vectors with respect to the testing phase, we have a size that is four times greater. But probably the most important thing is that the best classification has now moved from the hockey newsgroup to the sci.electronics newsgroup.

There’s more

We use exactly the same procedure used by the Mahout examples contained in the binaries folder that we downloaded. But you should now be aware that starting all process need only to change the input files from the initial folder. So, for the willing reader, we suggest you download another raw text file and perform all the steps in another type of file to see the changes that we have compared to the initial input text.

We would suggest that non-native English readers also look at the differences that we have by changing the initial input set with one not written in English. Since the whole text is transformed using only weight vectors, the outcome does not depend on the difference between languages but only on the probability of finding certain word couples.

As a final step, using the same input texts, you could try to change the way the algorithm normalizes and counts the words to create the vector sparse weights. This could be easily done by changing, for example, the -wt tfidf parameter into the command line Mahout seq2sparce. So, for example, an alternative run of the seq2sparce Mahout could be the following one:

mahout seq2sparse -i ${WORK_DIR}/20news-seq -o ${WORK_DIR}/20news- vectors -lnorm -nv -wt tfidf

Finally, we not only choose to run the Naïve Bayes classifier for classifying words in a text document but also the algorithm that uses vectors of weights so that, for example, it would be easy to create your own vector weights.

LEAVE A REPLY

Please enter your comment!
Please enter your name here