# Using cross-validation

0
1128

(For more resources related to this topic, see here.)

To start from, cross-validation is a common validation technique that can be used to evaluate machine learning models. Cross-validation essentially measures how well the estimated model will generalize some given data. This data is different from the training data supplied to our model, and is called the cross-validation set, or simply validation set, of our model. Cross-validation of a given model is also called rotation estimation.

If an estimated model performs well during cross-validation, we can assume that the model can understand the relationship between its various independent and dependent variables. The goal of cross-validation is to provide a test to determine if a formulated model is overfit on the training data. In the perspective of implementation, cross-validation is a kind of unit test for a machine learning system.

A single round of cross-validation generally involves partitioning all the available sample data into two subsets and then performing training on one subset and validation and/or testing on the other subset. Several such rounds, or folds, of cross-validation must be performed using different sets of data to reduce the variance of the overall cross-validation error of the given model. Any particular measure of the cross-validation error should be calculated as the average of this error over the different folds in cross-validation.

There are several types of cross-validation we can implement as a diagnostic for a given machine learning model or system. Let’s briefly explore a few of them as follows:

• A common type is k-fold cross-validation, in which we partition the cross-validation data into k equal subsets. The training of the model is then performed on subsets of the data and the cross-validation is performed on a single subset.
• A simple variation of k-fold cross-validation is 2-fold cross-validation, which is also called the holdout method. In 2-fold cross-validation, the training and cross-validation subsets of data will be almost equal in proportion.
• Repeated random subsampling is another simple variant of cross-validation in which the sample data is first randomized or shuffled and then used as training and cross-validation data. This method is notably not dependent on the number of folds used for cross-validation.
• Another form of k-fold cross-validation is leave-one-out cross-validation, in which only a single record from the available sample data is used for cross-validation. Leave-one-out cross-validation is essentially k-fold cross-validation in which k is equal to the number of samples or observations in the sample data.

Cross-validation basically treats the estimated model as a black box, that is, it makes no assumptions about the implementation of the model. We can also use cross-validation to select features in a given model by using cross-validation to determine the feature set that produces the best fit model over the given sample data. Of course, there are a couple of limitations of classification, which can be summarized as follows:

• If a given model is needed to perform feature selection internally, we must perform cross-validation for each selected feature set in the given model. This can be computationally expensive depending on the amount of available sample data.
• Cross-validation is not very useful if the sample data comprises exactly or nearly equal samples.

In summary, it’s a good practice to implement cross-validation for any machine learning system that we build. Also, we can choose an appropriate cross-validation technique depending on the problem we are trying to model as well as the nature of the collected sample data.

For the example that will follow, the namespace declaration should look similar to the following declaration:

```(ns my-namespace
(:use [clj-ml classifiers data]))```

We can use the clj-ml library to cross-validate the classifier we built for the fish packaging plant. Essentially, we built a classifier to determine whether a fish is a salmon or a sea bass using the clj-ml library. To recap, a fish is represented as a vector containing the category of the fish and values for the various features of the fish. The attributes of a fish are its length, width, and lightness of skin. We also described a template for a sample fish, which is defined as follows:

```(def fish-template
[{:category [:salmon :sea-bass]}
:length :width :lightness])```

The fish-template vector defined in the preceding code can be used to train a classifier with some sample data. For now, we will not bother about which classification algorithm we have used to model the given training data. We can only assume that the classifier was created using the make-classifier function from the clj-ml library. This classifier is stored in the *classifier* variable as follows:

`(def *classifier* (make-classifier ...))`

Suppose the classifier was trained with some sample data. We must now evaluate this trained classification model. To do this, we must first create some sample data to cross-validate. For the sake of simplicity, we will use randomly generated data in this example. We can generate this data using the make-sample-fish function. This function simply creates a new vector of some random values representing a fish. Of course, we must not forget the fact that the make-sample-fish function has an in-built partiality, so we create a meaningful pattern in a number of samples created using this function as follows:

```(def fish-cv-data
(for [i (range 3000)] (make-sample-fish)))```

We will need to use a dataset from the clj-ml library, and we can create one using the make-dataset function, as shown in the following code:

```(def fish-cv-dataset
(make-dataset "fish-cv" fish-template fish-cv-data))```

To cross-validate the classifier, we must use the classifier-evaluate function from the clj-ml.classifiers namespace. This function essentially performs k-fold cross-validation on the given data. Other than the classifier and the cross-validation dataset, this function requires the number of folds that we must perform on the data to be specified as the last parameter. Also, we will first need to set the class field of the records in fish-cv-dataset using the dataset-set-class function. We can define a single function to perform these operations as follows:

```(defn cv-classifier [folds]
(dataset-set-class fish-cv-dataset 0)
(classifier-evaluate *classifier* :cross-validation
fish-cv-dataset folds))```

We will use 10 folds of cross-validation on the classifier. Since the classifier-evaluate function returns a map, we bind this return value to a variable for further use, as follows:

```user> (def cv (cv-classifier 10))
#'user/cv```

We can fetch and print the summary of the preceding cross-validation using the :summary keyword as follows:

```user> (print (:summary cv))

Correctly Classified Instances        2986              99.5333 %
Incorrectly Classified Instances        14               0.4667 %
Kappa statistic                          0.9888
Mean absolute error                      0.0093
Root mean squared error                  0.0681
Relative absolute error                  2.2248 %
Root relative squared error             14.9238 %
Total Number of Instances             3000
nil```

As shown in the preceding code, we can view several statistical measures of performance for our trained classifier. Apart from the correctly and incorrectly classified records, this summary also describes the Root Mean Squared Error (RMSE) and several other measures of error in our classifier. For a more detailed view of the correctly and incorrectly classified instances in the classifier, we can print the confusion matrix of the cross-validation using the :confusion-matrix keyword, as shown in the following code:

```user> (print (:confusion-matrix cv))
=== Confusion Matrix ===

a    b   <-- classified as
2129    0 |    a = salmon
9  862 |    b = sea-bass
nil```

As shown in the preceding example, we can use the clj-ml library’s classifier-evaluate function to perform a k-fold cross-validation on any given classifier. Although we are restricted to using classifiers from the clj-ml library when using the classifier-evaluate function, we must strive to implement similar diagnostics in any machine learning system we build.