In this article by Gavin Hackeling, the author of Mastering Machine Learning with scikit-Learn, we will discuss an unsupervised learning task called clustering. Clustering is used to find groups of similar observations within a set of unlabeled data. We will discuss the K-Means clustering algorithm, apply it to an image compression problem, and learn to measure its performance. Finally, we will work through a semi-supervised learning problem that combines clustering with classification.

Clustering, or cluster analysis, is the task of grouping observations such that members of the same group, or cluster, are more similar to each other by some metric than they are to the members of the other clusters. As with supervised learning, we will represent an observation as an n-dimensional vector. For example, assume that your training data consists of the samples plotted in the following figure:

clustering-k-means-img-0

Clustering might reveal the following two groups, indicated by squares and circles.

clustering-k-means-img-1

Clustering could also reveal the following four groups:

clustering-k-means-img-2

Clustering is commonly used to explore a data set. Social networks can be clustered to identify communities, and to suggest missing connections between people. In biology, clustering is used to find groups of genes with similar expression patterns. Recommendation systems sometimes employ clustering to identify products or media that might appeal to a user. In marketing, clustering is used to find segments of similar consumers. In the following sections we will work through an example of using the K-Means algorithm to cluster a data set.

Clustering with the K-Means Algorithm

The K-Means algorithm is a clustering method that is popular because of its speed and scalability. K-Means is an iterative process of moving the centers of the clusters, or the centroids, to the mean position of their constituent points, and re-assigning instances to their closest clusters. The titular K is a hyperparameter that specifies the number of clusters that should be created; K-Means automatically assigns observations to clusters but cannot determine the appropriate number of clusters. K must be a positive integer that is less than the number of instances in the training set. Sometimes the number of clusters is specified by the clustering problem's context. For example, a company that manufactures shoes might know that it is able to support manufacturing three new models. To understand what groups of customers to target with each model, it surveys customers and creates three clusters from the results. That is, the value of K was specified by the problem's context. Other problems may not require a specific number of clusters, and the optimal number of clusters may be ambiguous. We will discuss a heuristic for estimating the optimal number of clusters called the elbow method later in this article.

The parameters of K-Means are the positions of the clusters' centroids and the observations that are assigned to each cluster. Like generalized linear models and decision trees, the optimal values of K-Means' parameters are found by minimizing a cost function. The cost function for K-Means is given by the following equation:

clustering-k-means-img-3

where µk is the centroid for cluster k. The cost function sums the distortions of the clusters. Each cluster's distortion is equal to the sum of the squared distances between its centroid and its constituent instances. The distortion is small for compact clusters, and large for clusters that contain scattered instances. The parameters that minimize the cost function are learned through an iterative process of assigning observations to clusters and then moving the clusters. First, the clusters' centroids are initialized to random positions. In practice, setting the centroids' positions equal to the positions of randomly selected observations yields the best results. During each iteration, K-Means assigns observations to the cluster that they are closest to, and then moves the centroids to their assigned observations' mean location. Let's work through an example by hand using the training data shown in the following table.

Instance	X0	X1
1	7	5
2	5	7
3	7	7
4	3	3
5	4	6
6	1	4
7	0	0
8	2	2
9	8	7
10	6	8
11	5	5
12	3	7

There are two explanatory variables; each instance has two features. The instances are plotted in the following figure.

clustering-k-means-img-4

Assume that K-Means initializes the centroid for the first cluster to the fifth instance and the centroid for the second cluster to the eleventh instance. For each instance, we will calculate its distance to both centroids, and assign it to the cluster with the closest centroid. The initial assignments are shown in the “Cluster” column of the following table.

Instance	X0	X1	C1 distance	C2 distance	Last Cluster	Cluster	Changed?
1	7	5	3.16228	2	None	C2	Yes
2	5	7	1.41421	2	None	C1	Yes
3	7	7	3.16228	2.82843	None	C2	Yes
4	3	3	3.16228	2.82843	None	C2	Yes
5	4	6	0	1.41421	None	C1	Yes
6	1	4	3.60555	4.12311	None	C1	Yes
7	0	0	7.21110	7.07107	None	C2	Yes
8	2	2	4.47214	4.24264	None	C2	Yes
9	8	7	4.12311	3.60555	None	C2	Yes
10	6	8	2.82843	3.16228	None	C1	Yes
11	5	5	1.41421	0	None	C2	Yes
12	3	7	1.41421	2.82843	None	C1	Yes
C1 centroid Unlock access to the largest independent learning library in Tech for FREE! Get unlimited access to 7500+ expert-authored eBooks and video courses covering every tech area you can think of. Renews at $19.99/month. Cancel anytime	4	6
C2 centroid	5	5

The plotted centroids and the initial cluster assignments are shown in the following graph. Instances assigned to the first cluster are marked with “Xs”, and instances assigned to the second cluster are marked with dots. The markers for the centroids are larger than the markers for the instances.

clustering-k-means-img-5

Now we will move both centroids to the means of their constituent instances, re-calculate the distances of the training instances to the centroids, and re-assign the instances to the closest centroids.

Instance	X0	X1	C1 distance	C2 distance	Last Cluster	New Cluster	Changed?
1	7	5	3.492850	2.575394	C2	C2	No
2	5	7	1.341641	2.889107	C1	C1	No
3	7	7	3.255764	3.749830	C2	C1	Yes
4	3	3	3.492850	1.943067	C2	C2	No
5	4	6	0.447214	1.943067	C1	C1	No
6	1	4	3.687818	3.574285	C1	C2	Yes
7	0	0	7.443118	6.169378	C2	C2	No
8	2	2	4.753946	3.347250	C2	C2	No
9	8	7	4.242641	4.463000	C2	C1	Yes
10	6	8	2.720294	4.113194	C1	C1	No
11	5	5	1.843909	0.958315	C2	C2	No
12	3	7	1	3.260775	C1	C1	No
C1 centroid	3.8	6.4
C2 centroid	4.571429	4.142857

The new clusters are plotted in the following graph. Note that the centroids are diverging, and several instances have changed their assignments.

clustering-k-means-img-6

Now we will move the centroids to the means of their constituents' locations again, and re-assign the instances to their nearest centroids. The centroids continue to diverge, as shown in the following figure.

clustering-k-means-img-7

None of the instances' centroid assignments will change in the next iteration; K-Means will continue iterating until some stopping criteria is satisfied. Usually, this criteria is either a threshold for the difference between the values of the cost function for subsequent iterations, or a threshold for the change in the positions of the centroids between subsequent iterations. If these stopping criteria are small enough, K-Means will converge on an optimum. This optimum will not necessarily be the global optimum.

Local Optima

Recall that K-Means initially sets the positions of the clusters' centroids to the positions of randomly selected observations. Sometimes the random initialization is unlucky, and the centroids are set to positions that cause K-Means to converge to a local optimum. For example, assume that K-Means randomly initializes two cluster centroids to the following positions:

clustering-k-means-img-8

K-Means will eventually converge on a local optimum like that shown in the following figure. These clusters may be informative, but it is more likely that the top and bottom groups of observations are more informative clusters. To avoid local optima, K-Means is often repeated dozens or hundreds of times. In each iteration, it is randomly initialized to different starting cluster positions. The initialization that minimizes the cost function best is selected.

clustering-k-means-img-9

The Elbow Method

If K is not specified by the problem's context, the optimal number of clusters can be estimated using a technique called the elbow method. The elbow method plots the value of the cost function produced by different values of K. As K increases, the average distortion will decrease; each cluster will have fewer constituent instances, and the instances will be closer to their respective centroids. However, the improvements to the average distortion will decline as K increases. The value of K at which the improvement to the distortion declines the most is called the elbow.

Let's use the elbow method to choose the number of clusters for a data set. The following scatter plot visualizes a data set with two obvious clusters.

clustering-k-means-img-10

We will calculate and plot the mean distortion of the clusters for each value of K from one to ten with the following:

>>> import numpy as np
>>> from sklearn.cluster import KMeans
>>> from scipy.spatial.distance import cdist
>>> import matplotlib.pyplot as plt

>>> cluster1 = np.random.uniform(0.5, 1.5, (2, 10))
>>> cluster2 = np.random.uniform(3.5, 4.5, (2, 10))
>>> X = np.hstack((cluster1, cluster2)).T
>>> X = np.vstack((x, y)).T

>>> K = range(1, 10)
>>> meandistortions = []
>>> for k in K:
>>> kmeans = KMeans(n_clusters=k)
>>> kmeans.fit(X)
>>> meandistortions.append(sum(np.min(cdist(X, kmeans.cluster_
centers_, 'euclidean'), axis=1)) / X.shape[0])
>>> plt.plot(K, meandistortions, 'bx-')
>>> plt.xlabel('k')
>>> plt.ylabel('Average distortion')
>>> plt.title('Selecting k with the Elbow Method')
>>> plt.show()

clustering-k-means-img-11

The average distortion improves rapidly as we increase K from one to two. There is little improvement for values of K greater than two. Now let's use the elbow method on the following data set with three clusters:

clustering-k-means-img-12

The following is the elbow plot for the data set. From this we can see that the rate of improvement to the average distortion declines the most when adding a fourth cluster. That is, the elbow method confirms that K should be set to three for this data set.

clustering-k-means-img-13