In this article by Gavin Hackeling, author of book Mastering Machine Learning with scikit-learn - Second Edition, we will start with K Nearest Neighbors (KNN) which is a simple model for regression and classification tasks. It is so simple that its name describes most of its learning algorithm. The titular neighbors are representations of training instances in a metric space. A metric space is a feature space in which the distances between all members of a set are defined.

(For more resources related to this topic, see here.)

For classification tasks, a set of tuples of feature vectors and class labels comprise the training set. KNN is a capable of binary, multi-class, and multi-label classification. We will focus on binary classification in this article. The simplest KNN classifiers use the mode of the KNN labels to classify test instances, but other strategies can be used. k is often set to an odd number to prevent ties. In regression tasks, the features vectors are each associated with a response variable that takes a real-valued scalar instead of a label. The prediction is the mean or weighted mean of the k nearest neighbors’ response variables.

Lazy learning and non-parametric models

KNN is a lazy learner. Also known as instance-based learners, lazy learners simply store the training data set with little or no processing. In contrast to eager learners, such as simple linear regression, KNN does not estimate the parameters of a model that generalizes the training data during a training phase. Lazy learning has advantages and disadvantages. Training an eager learner is often computationally costly, but prediction with the resulting model is often inexpensive. For simple linear regression, prediction consists only of multiplying the learned coefficient by the feature, and adding the learned intercept parameter. A lazy learner can predict almost immediately, but making predictions can be costly. In the simplest implementation of KNN, prediction requires calculating the distances between a test instance and all of the training instances.

In contrast to most of the other models we will discuss, KNN is a non-parametric model. A parametric model uses a fixed number of parameters, or coefficients, to define the model that summarizes the data. The number of parameters is independent of the number of training instances. Non-parametric may seem to be a misnomer, as it does not mean that the model has no parameters; rather, non-parametric means that the number of parameters of the model is not fixed, and may grow with the number of training instances.

Non-parametric models can be useful when training data is abundant and you have little prior knowledge about the relationship between the response and explanatory variables. KNN makes only one assumption: instances that are near each other are likely to have similar values of the response variable. The flexibility provided by non-parametric models is not always desirable; a model that makes assumptions about the relationship can be useful if training data is scarce or if you already know about the relationship.

Classification with KNN

The goal of classification tasks is to use one or more features to predict the value of a discrete response variable. Let’s work through a toy classification problem. Assume that you must use a person’s height and weight to predict his or her sex. This problem is called binary classification because the response variable can take one of two labels. The following table records nine training instances.

height	weight	label
158 cm	64 kg	male
170 cm	66 kg	male
183 cm	84 kg	male
191 cm	80 kg	male
155 cm	49 kg	female
163 cm	59 kg	female
180 cm	67 kg	female
158 cm	54 kg	female
178 cm	77 kg	female

We are now using features from two explanatory variables to predict the value of the response variable. KNN is not limited to two features; the algorithm can use an arbitrary number of features, but more than three features cannot be visualized. Let’s visualize the data by creating a scatter plot with matplotlib.

# In[1]:
import numpy as np
import matplotlib.pyplot as plt

X_train = np.array([
 [158, 64],
 [170, 86],
 [183, 84],
 [191, 80],
 [155, 49],
 [163, 59],
 [180, 67],
 [158, 54],
 [170, 67]
])
y_train = ['male', 'male', 'male', 'male', 'female', 'female', 'female', 'female', 'female']

plt.figure()
plt.title('Human Heights and Weights by Sex')
plt.xlabel('Height in cm')
plt.ylabel('Weight in kg')

for i, x in enumerate(X_train):
    # Use 'x' markers for instances that are male and diamond markers for instances that are female
    plt.scatter(x[0], x[1], c='k', marker='x' if y_train[i] == 'male' else 'D')
plt.grid(True)
plt.show()

From the plot we can see that men, denoted by the x markers, tend to be taller and weigh more than women. This observation is probably consistent with your experience. Now let’s use KNN to predict whether a person with a given height and weight is a man or a woman. Let’s assume that we want to predict the sex of a person who is 155 cm tall and who weighs 70 kg. First, we must define our distance measure. In this case we will use Euclidean distance, the straight line distance between points in a Euclidean space. Euclidean distance in a two-dimensional space is given by the following:

Next we must calculate the distances between the query instance and all of the training instances.

height	weight	label	Distance from test instance
158 cm	64 kg	male
170 cm	66 kg	male	Unlock access to the largest independent learning library in Tech for FREE! Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of. Renews at $19.99/month. Cancel anytime
183 cm	84 kg	male
191 cm	80 kg	male
155 cm	49 kg	female
163 cm	59 kg	female
180 cm	67 kg	female
158 cm	54 kg	female
178 cm	77 kg	female

We will set k to 3, and select the three nearest training instances. The following script calculates the distances between the test instance and the training instances, and identifies the most common sex of the nearest neighbors.

# In[2]:
x = np.array([[155, 70]])
distances = np.sqrt(np.sum((X_train - x)**2, axis=1))
distances

# Out[2]:
array([ 6.70820393, 21.9317122 , 31.30495168, 37.36308338, 21. , 13.60147051, 25.17935662, 16.2788206 , 15.29705854])

# In[3]:
nearest_neighbor_indices = distances.argsort()[:3]
nearest_neighbor_genders = np.take(y_train, nearest_neighbor_indices)
nearest_neighbor_genders

# Out[3]:
array(['male', 'female', 'female'], 
 dtype='|S6')

# In[4]:
from collections import Counter
b = Counter(np.take(y_train, distances.argsort()[:3]))
b.most_common(1)[0][0]

# Out[4]:
'female'

The following plots the query instance, indicated by the circle, and its three nearest neighbors, indicated by the enlarged markers:

Two of the neighbors are female, and one is male. We therefore predict that the test instance is female. Now let’s implement a KNN classifier using scikit-learn.

# In[5]:
from sklearn.preprocessing import LabelBinarizer
from sklearn.neighbors import KNeighborsClassifier

lb = LabelBinarizer()
y_train_binarized = lb.fit_transform(y_train)
y_train_binarized

# Out[5]:
array([[1],
       [1],
       [1],
       [1],
       [0],
       [0],
       [0],
       [0],
       [0]])

# In[6]:
K = 3
clf = KNeighborsClassifier(n_neighbors=K)
clf.fit(X_train, y_train_binarized.reshape(-1))
prediction_binarized = clf.predict(np.array([155, 70]).reshape(1, -1))[0]
predicted_label = lb.inverse_transform(prediction_binarized)
predicted_label

# Out[6]:
array(['female'], 
      dtype='|S6')

Our labels are strings; we first use LabelBinarizer to convert them to integers. LabelBinarizer implements the transformer interface, which consists of the methods fit, transform, and fit_transform. fit prepares the transformer; in this case, it creates a mapping from label strings to integers. transform applies the mapping to input labels. fit_transform is a convenience method that calls fit and transform. A transformer should be fit only on the training set. Independently fitting and transforming the training and testing sets could result in inconsistent mappings from labels to integers; in this case, male might be mapped to 1 in the training set and 0 in the testing set. Fitting on the entire dataset should also be avoided, as for some transformers it will leak information about the testing set in to the model. This advantage won't be available in production, so performance measures on the test set may be optimistic. We wil discuss this pitfall more when we extract features from text.

Next, we initialize a KNeighborsClassifier. Even through KNN is a lazy learner, it still implements the estimator interface. We call fit and predict just as we did with our simple linear regression object. Finally, we can use our fit LabelBinarizer to reverse the transformation and return a string label. Now let’s use our classifier to make predictions for a test set, and evaluate the performance of our classifier.

height	weight	label
168 cm	65 kg	male
170 cm	61 kg	male
160 cm	52 kg	female
169 cm	67 kg	female

# In[7]:
X_test = np.array([
 [168, 65],
 [180, 96],
 [160, 52],
 [169, 67]
])
y_test = ['male', 'male', 'female', 'female']
y_test_binarized = lb.transform(y_test)
print('Binarized labels: %s' % y_test_binarized.T[0])
predictions_binarized = clf.predict(X_test)
print('Binarized predictions: %s' % predictions_binarized)
print('Predicted labels: %s' % lb.inverse_transform(predictions_binarized))

# Out[7]:
Binarized labels: [1 1 0 0]
Binarized predictions: [0 1 0 0]
Predicted labels: ['female' 'male' 'female' 'female']

By comparing our test labels to our classifier's predictions, we find that it incorrectly predicted that one of the male test instances was female. There are two types of errors in binary classification tasks: false positives and false negatives. There are many performance measures for classifiers; some measures may be more appropriate than others depending on the consequences of the types of errors in your application. We will assess our classifier using several common performance measures, including accuracy, precision, and recall. Accuracy is the proportion of test instances that were classified correctly. Our model classified one of the four instances incorrectly, so the accuracy is 75%.

# In[8]:
from sklearn.metrics import accuracy_score
print('Accuracy: %s' % accuracy_score(y_test_binarized, predictions_binarized))

# Out[8]:
Accuracy: 0.75

Precision is the proportion of test instances that were predicted to be positive that are truly positive. In this example the positive class is male. The assignment of male and “female” to the positive and negative classes is arbitrary, and could be reversed. Our classifier predicted that one of the test instances is the positive class. This instance is truly the positive class, so the classifier’s precision is 100%.

# In[9]:
from sklearn.metrics import precision_score
print('Precision: %s' % precision_score(y_test_binarized, predictions_binarized))

# Out[9]:
Precision: 1.0

Recall is the proportion of truly positive test instances that were predicted to be positive. Our classifier predicted that one of the two truly positive test instances is positive. Its recall is therefore 50%.

# In[10]:
from sklearn.metrics import recall_score
print('Recall: %s' % recall_score(y_test_binarized, predictions_binarized))
# Out[10]:
Recall: 0.5

Sometimes it is useful to summarize precision and recall with a single statistic, called the F1-score or F1-measure. The F1-score is the harmonic mean of precision and recall.

# In[11]:
from sklearn.metrics import f1_score
print('F1 score: %s' % f1_score(y_test_binarized, predictions_binarized))

# Out[11]:
F1 score: 0.666666666667

Note that the arithmetic mean of the precision and recall scores is the upper bound of the F1 score. The F1 score penalizes classifiers more as the difference between their precision and recall scores increases. Finally, the Matthews correlation coefficient is an alternative to the F1 score for measuring the performance of binary classifiers. A perfect classifier’s MCC is 1. A trivial classifier that predicts randomly will score 0, and a perfectly wrong classifier will score -1. MCC is useful even when the proportions of the classes in the test set is severely imbalanced.

# In[12]:
from sklearn.metrics import matthews_corrcoef
print('Matthews correlation coefficient: %s' % matthews_corrcoef(y_test_binarized, predictions_binarized))

# Out[12]:
Matthews correlation coefficient: 0.57735026919

scikit-learn also provides a convenience function, classification_report, that reports the precision, recall and F1 score.

# In[13]:
from sklearn.metrics import classification_report
print(classification_report(y_test_binarized, predictions_binarized, target_names=['male'], labels=[1]))

# Out[13]:
             precision recall f1-score support
       male       1.00   0.50     0.67       2
avg / total       1.00   0.50     0.67       2

Summary

In this article we learned about K Nearest Neighbors in which we saw that KNN is lazy learner as well as non-parametric model. We also saw about the classification of KNN.