Implementing the k-nearest neighbors algorithm in Python

[box type="note" align="" class="" width=""]The following is an excerpt from Dávid Natingga's Data Science Algorithms in a Week. [/box]

The nearest neighbor algorithm classifies a data instance based on its neighbors. The class of a data instance determined by the k-nearest neighbor algorithm is the class with the highest representation among the k-closest neighbors.

In this short tutorial, we will cover the basics of the k-NN algorithm - understanding it and its
implementation with a simple example: Mary and her temperature preferences. So let’s get right to it, shall we?

Mary and her temperature preferences

As an example, if we know that our friend Mary feels cold when it is 10 degrees Celsius, but warm when it is 25 degrees Celsius, then in a room where it is 22 degrees Celsius, the nearest neighbor algorithm would guess that our friend would feel warm, because 22 is closer to 25 than to 10.

Suppose we would like to know when Mary feels warm and when she feels cold, as in the previous example, but in addition, wind speed data is also available when Mary was asked if she felt warm or cold:

implementing-k-nearest-neighbors-algorithm-python-img-0

We could represent the data in a graph, as follows:

implementing-k-nearest-neighbors-algorithm-python-img-1

Now, suppose we would like to find out how Mary feels at the temperature 16 degrees Celsius with a wind speed of 3km/h using the 1-NN algorithm:

implementing-k-nearest-neighbors-algorithm-python-img-2

For simplicity, we will use a Manhattan metric to measure the distance between the neighbors on the grid. The Manhattan distance dMan of a neighbor N1=(x1,y1) from the neighbor N2=(x2,y2) is defined to be dMan=|x1- x2|+|y1- y2|.

Let us label the grid with distances around the neighbors to see which neighbor with a known class is closest to the point we would like to classify:

implementing-k-nearest-neighbors-algorithm-python-img-3

We can see that the closest neighbor with a known class is the one with the temperature 15 (blue) degrees Celsius and the wind speed 5km/h. Its distance from the questioned point is three units. Its class is blue (cold). The closest red (warm) neighbor is away four units from the questioned point. Since we are using the 1-nearest neighbor algorithm, we just look at the closest neighbor and, therefore, the class of the questioned point should be blue (cold).

By applying this procedure to every data point, we can complete the graph as follows:

implementing-k-nearest-neighbors-algorithm-python-img-4

Note that sometimes a data point can be distanced from two known classes with the same distance: for example, 20 degrees Celsius and 6km/h. In such situations, we would prefer one class over the other or ignore these boundary cases. The actual result depends on the specific implementation of an algorithm.

Implementation of k-nearest neighbors algorithm in Python

We’ll implement the k-NN algorithm in Python to find Mary's temperature preference:

# source_code/1/mary_and_temperature_preferences/knn_to_data.py
# Applies the knn algorithm to the input data.
# The input text file is assumed to be of the format with one line per
# every data entry consisting of the temperature in degrees Celsius,
# wind speed and then the classification cold/warm.

import sys
sys.path.append('..')
sys.path.append('../../common')
import knn # noqa
import common # noqa

# Program start
# E.g. "mary_and_temperature_preferences.data"
input_file = sys.argv[1]
# E.g. "mary_and_temperature_preferences_completed.data"
output_file = sys.argv[2]
k = int(sys.argv[3])
x_from = int(sys.argv[4])
x_to = int(sys.argv[5])
y_from = int(sys.argv[6])
y_to = int(sys.argv[7])

data = common.load_3row_data_to_dic(input_file)
new_data = knn.knn_to_2d_data(data, x_from, x_to, y_from, y_to, k)
common.save_3row_data_from_dic(output_file, new_data)

# source_code/common/common.py
# ***Library with common routines and functions***
def dic_inc(dic, key):
     if key is None:
          Pass
     if dic.get(key, None) is None:
          dic[key] = 1
     Else:
          dic[key] = dic[key] + 1

# source_code/1/knn.py
# ***Library implementing knn algorihtm***

def info_reset(info):
     info['nbhd_count'] = 0
     info['class_count'] = {}

# Find the class of a neighbor with the coordinates x,y.
# If the class is known count that neighbor.
def info_add(info, data, x, y):
     group = data.get((x, y), None)
     common.dic_inc(info['class_count'], group)
     info['nbhd_count'] += int(group is not None)

# Apply knn algorithm to the 2d data using the k-nearest neighbors with
# the Manhattan distance.
# The dictionary data comes in the form with keys being 2d coordinates
# and the values being the class.
# x,y are integer coordinates for the 2d data with the range
# [x_from,x_to] x [y_from,y_to].
def knn_to_2d_data(data, x_from, x_to, y_from, y_to, k):
     new_data = {}
     info = {}
     # Go through every point in an integer coordinate system.
     for y in range(y_from, y_to + 1):
           for x in range(x_from, x_to + 1):
                info_reset(info)
                # Count the number of neighbors for each class group for 
                # every distance dist starting at 0 until at least k
                # neighbors with known classes are found.
                for dist in range(0, x_to - x_from + y_to - y_from):
                      # Count all neighbors that are distanced dist from
                      # the point [x,y].
                      if dist == 0:
                           info_add(info, data, x, y)
                      Else:
                           for i in range(0, dist + 1):
                                info_add(info, data, x - i, y + dist - i)
                                info_add(info, data, x + dist - i, y - i)
                           for i in range(1, dist):
                                info_add(info, data, x + i, y + dist - i)
                                info_add(info, data, x - dist + i, y - i)
                      # There could be more than k-closest neighbors if the
                      # distance of more of them is the same from the point
                      # [x,y]. But immediately when we have at least k of
                      # them, we break from the loop.
                      if info['nbhd_count'] >= k:
                           Break
                 class_max_count = None
                 # Choose the class with the highest count of the neighbors
                 # from among the k-closest neighbors.
                 for group, count in info['class_count'].items():
                      if group is not None and (class_max_count is None or
                           count > info['class_count'][class_max_count]):
                                                 class_max_count = group
                 new_data[x, y] = class_max_count
     return new_data

Input:

The program above will use the file below as the source of the input data. The file contains the table with the known data about Mary's temperature preferences:

# source_code/1/mary_and_temperature_preferences/
marry_and_temperature_preferences.data
10 0 cold
25 0 warm
15 5 cold
20 3 warm
18 7 cold
20 10 cold
22 5 warm
24 6 warm

Output:

We run the implementation above on the input file mary_and_temperature_preferences.data using the k-NN algorithm for k=1 neighbors. The algorithm classifies all the points with the integer coordinates in the rectangle with a size of (30-5=25) by (10-0=10), so with the a of (25+1) * (10+1) = 286 integer points (adding one to count points on boundaries). Using the wc command, we find out that the output file contains exactly 286 lines - one data item per point. Using the head command, we display the first 10 lines from the output file. We visualize all the data from the output file in the next section:

$ python knn_to_data.py mary_and_temperature_preferences.data
mary_and_temperature_preferences_completed.data 1 5 30 0 10

$ wc -l mary_and_temperature_preferences_completed.data
286 mary_and_temperature_preferences_completed.data

$ head -10 mary_and_temperature_preferences_completed.data

7 3 cold
6 9 cold
12 1 cold
16 6 cold
16 9 cold
14 4 cold
13 4 cold
19 4 warm
18 4 cold
15 1 cold

So, there you have it! The K Nearest Neighbors algorithm explained and implemented in Python. I hope you enjoyed this tutorial and found it interesting. If you want more, go ahead and purchase Dávid Natingga's Data Science Algorithms in a Week, from which the tutorial has been extracted.