(For more resources related to this topic, see here.)
Supervised learning for classification
Like clustering, classification is also about categorizing data instances, but in this case, the categories are known and are termed as class labels. Thus, it aims at identifying the category that a new data point belongs to. It uses a dataset where the class labels are known to find the pattern. Classification is an instance of supervised learning where the learning algorithm takes a known set of input data and corresponding responses called class labels and builds a predictor model that generates reasonable predictions for the class labels in the unknown data. To illustrate, let’s imagine that we have gene expression data from cancer patients as well as healthy patients. The gene expression pattern in these samples can define whether the patient has cancer or not. In this case, if we have a set of samples for which we know the type of tumor, the data can be used to learn a model that can identify the type of tumor. In simple terms, it is a predictive function used to determine the tumor type. Later, this model can be applied to predict the type of tumor in unknown cases.
There are some do’s and don’ts to keep in mind while learning a classifier. You need to make sure that you have enough data to learn the model. Learning with smaller datasets will not allow the model to learn the pattern in an unbiased manner and again, you will end up with an inaccurate classification. Furthermore, the preprocessing steps (such as normalization) for the training and test data should be the same. Another important thing that one should take care of is to keep the training and test data distinct. Learning on the entire data and then using a part of this data for testing will lead to a phenomenon called over fitting. It is always recommended that you take a look at it manually and understand the question that you need to answer via your classifier.
There are several methods of classification. In this recipe, we will talk about some of these methods. We will discuss linear discriminant analysis (LDA), decision tree (DT), and support vector machine (SVM).
To perform the classification task, we need two preparations. First, a dataset with known class labels (training set), and second, the test data that the classifier has to be tested on (test set). Besides this, we will use some R packages, which will be discussed when required. As a dataset, we will use approximately 2300 gene from tumor cells. The data has ~83 data points with four different types of tumors. These will be used as our class labels. We will use 60 of the data points for the training and the remaining 23 for the test. To find out more about the dataset, refer to the Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks article by Khan and others (http://research.nhgri.nih.gov/microarray/Supplement/). The set has been precompiled in a format that is readily usable in R and is available on the book’s web page (code files) under the name cancer.rda.
How to do it…
To classify data points based on their features, perform the following steps:
- First, load the following MASS library as it has some of the classification functions:
- Now, you need your data to learn and test the classifiers. Load the data from the code files available on the book’s web page (cancer.rda) as follows:
> load ("path/to/code/directory/cancer.rda") # located in the code file directory for the chapter, assign the path accordingly
- Randomly sample 60 data points for the training and the remaining 23 for the test set as follows—ensure that these two datasets do not overlap and are not biased towards any specific tumor type (random sampling):
> train test data points
- For the training data, retain the class labels, which are the tumor columns here, and remove this information from the test data. However, store this information for comparison purposes:
> testClass test$tumor
- Now, try the linear discriminate analysis classifier, as follows, to get the classifier model:
- Test this classifier to predict the labels on your test set, as follows:
- To check the number of correct and incorrect predictions, simply compare the predicted classes with the testClass object, which was created in step 4, as follows:
> sum(testRes_lda$class == testClass) # correct prediction  19 > sum(testRes_lda$class != testClass) # incorrect prediction  4
- Now, try another simple classifier called DT. For this, you need the rpart package:
- Create the decision tree based on your training data, as follows:
- Plot your tree by typing the following commands, as shown in the next diagram:
> plot(myDT) > text(myDT, use.n=T)
The following screenshot shows the cut off for each feature (represented by the branches) to differentiate between the classes:
The tree for DT-based learning
- Now, test the decision tree classifier on your test data using the following prediction function:
- Take a look at the species that each data instance is put in by the predicted classifier, as follows (1 if predicted in the class, else 0):
> classes head(classes) BL EW NB RM 4 0 0 0 1 10 0 0 0 1 15 1 0 0 0 16 0 0 1 0 18 0 1 0 0 21 0 1 0 0
- Finally, you’ll work with SVMs. To be able to use them, you need another R package named e1071 as follows:
- Create the svm classifier from the training data as follows:
- Then, use your classifier, the model (mySVM object) learned to predict for the test data. You will see the predicted labels for each instance as follows:
> testRes_svm testRes_svm
How it works…
We started our recipe by loading the input data on tumors. The supervised learning methods we saw in the recipe used two datasets: the training set and test set. The training set carries the class label information. The first part in most of the learning methods shown here, the training set is used to identify a pattern and model the pattern to find a distinction between the classes. This model is then applied on the test set that does not have the class label data to predict the class labels. To identify the training and test sets, we first randomly sample 60 indexes out of the entire data and use the remaining 23 for testing purposes.
The supervised learning methods explained in this recipe follow a different principle. LDA attempts to model the difference between classes based on the linear combination of its features. This combination function forms the model based on the training set and is used to predict the classes in the test set. The LDA model trained on 60 samples is then used to predict for the remaining 23 cases.
DT is, however, a different method. It forms regression trees that form a set of rules to distinguish one class from the other. The tree learned on a training set is applied to predict classes in test sets or other similar datasets.
SVM is a relatively complex technique of classification. It aims to create a hyperplane(s) in the feature space, making the data points separable along these planes. This is done on a training set and is then used to assign classes to new data points. In general, LDA uses linear combination and SVM uses multiple dimensions as the hyperplane for data distinction. In this recipe, we used the svm functionality from the e1071 package, which has many other utilities for learning.
We can compare the results obtained by the models we used in this recipe (they can be computed using the provided code on the book’s web page).
One of the most popular classifier tools in the machine learning community is WEKA. It is a Java-based tool and implements many libraries to perform classification tasks using DT, LDA, Random Forest, and so on. R supports an interface to the WEKA with a library named RWeka. It is available on the CRAN repository at http://cran.r-project.org/web/packages/RWeka/ .
It uses RWekajars, a separate package, to use the Java libraries in it that implement different classifiers.
- The Elements of Statistical Learning book by Hastie, Tibshirani, and Friedman at http://statweb.stanford.edu/~tibs/ElemStatLearn/printings/ESLII_print10.pdf, which provides more information on LDA, DT, and SVM