8 min read

In this article by Gavin Hackeling, the author of Mastering Machine Learning with scikit-Learn, we will discuss an unsupervised learning task called clustering. Clustering is used to find groups of similar observations within a set of unlabeled data. We will discuss the K-Means clustering algorithm, apply it to an image compression problem, and learn to measure its performance. Finally, we will work through a semi-supervised learning problem that combines clustering with classification.

Clustering, or cluster analysis, is the task of grouping observations such that members of the same group, or cluster, are more similar to each other by some metric than they are to the members of the other clusters. As with supervised learning, we will represent an observation as an n-dimensional vector. For example, assume that your training data consists of the samples plotted in the following figure:

Mastering Machine Learning with scikit-learn

Clustering might reveal the following two groups, indicated by squares and circles.

Mastering Machine Learning with scikit-learn

Clustering could also reveal the following four groups:

Mastering Machine Learning with scikit-learn

Clustering is commonly used to explore a data set. Social networks can be clustered to identify communities, and to suggest missing connections between people. In biology, clustering is used to find groups of genes with similar expression patterns. Recommendation systems sometimes employ clustering to identify products or media that might appeal to a user. In marketing, clustering is used to find segments of similar consumers. In the following sections we will work through an example of using the K-Means algorithm to cluster a data set.

Clustering with the K-Means Algorithm

The K-Means algorithm is a clustering method that is popular because of its speed and scalability. K-Means is an iterative process of moving the centers of the clusters, or the centroids, to the mean position of their constituent points, and re-assigning instances to their closest clusters. The titular K is a hyperparameter that specifies the number of clusters that should be created; K-Means automatically assigns observations to clusters but cannot determine the appropriate number of clusters. K must be a positive integer that is less than the number of instances in the training set. Sometimes the number of clusters is specified by the clustering problem’s context. For example, a company that manufactures shoes might know that it is able to support manufacturing three new models. To understand what groups of customers to target with each model, it surveys customers and creates three clusters from the results. That is, the value of K was specified by the problem’s context. Other problems may not require a specific number of clusters, and the optimal number of clusters may be ambiguous. We will discuss a heuristic for estimating the optimal number of clusters called the elbow method later in this article.

The parameters of K-Means are the positions of the clusters’ centroids and the observations that are assigned to each cluster. Like generalized linear models and decision trees, the optimal values of K-Means’ parameters are found by minimizing a cost function. The cost function for K-Means is given by the following equation:

Mastering Machine Learning with scikit-learn

where µk is the centroid for cluster k. The cost function sums the distortions of the clusters. Each cluster’s distortion is equal to the sum of the squared distances between its centroid and its constituent instances. The distortion is small for compact clusters, and large for clusters that contain scattered instances. The parameters that minimize the cost function are learned through an iterative process of assigning observations to clusters and then moving the clusters. First, the clusters’ centroids are initialized to random positions. In practice, setting the centroids’ positions equal to the positions of randomly selected observations yields the best results. During each iteration, K-Means assigns observations to the cluster that they are closest to, and then moves the centroids to their assigned observations’ mean location. Let’s work through an example by hand using the training data shown in the following table.

Instance

X0

X1

1

7

5

2

5

7

3

7

7

4

3

3

5

4

6

6

1

4

7

0

0

8

2

2

9

8

7

10

6

8

11

5

5

12

3

7

There are two explanatory variables; each instance has two features. The instances are plotted in the following figure.

Mastering Machine Learning with scikit-learn

Assume that K-Means initializes the centroid for the first cluster to the fifth instance and the centroid for the second cluster to the eleventh instance. For each instance, we will calculate its distance to both centroids, and assign it to the cluster with the closest centroid. The initial assignments are shown in the “Cluster” column of the following table.

Instance

X0

X1

C1 distance

C2 distance

Last Cluster

Cluster

Changed?

1

7

5

3.16228

2

None

C2

Yes

2

5

7

1.41421

2

None

C1

Yes

3

7

7

3.16228

2.82843

None

C2

Yes

4

3

3

3.16228

2.82843

None

C2

Yes

5

4

6

0

1.41421

None

C1

Yes

6

1

4

3.60555

4.12311

None

C1

Yes

7

0

0

7.21110

7.07107

None

C2

Yes

8

2

2

4.47214

4.24264

None

C2

Yes

9

8

7

4.12311

3.60555

None

C2

Yes

10

6

8

2.82843

3.16228

None

C1

Yes

11

5

5

1.41421

0

None

C2

Yes

12

3

7

1.41421

2.82843

None

C1

Yes

C1 centroid

4

6

 

 

 

 

 

C2 centroid

5

5

 

 

 

 

 

The plotted centroids and the initial cluster assignments are shown in the following graph. Instances assigned to the first cluster are marked with “Xs”, and instances assigned to the second cluster are marked with dots. The markers for the centroids are larger than the markers for the instances.

Mastering Machine Learning with scikit-learn

Now we will move both centroids to the means of their constituent instances, re-calculate the distances of the training instances to the centroids, and re-assign the instances to the closest centroids.

Instance

X0

X1

C1 distance

C2 distance

Last Cluster

New Cluster

Changed?

1

7

5

3.492850

2.575394

C2

C2

No

2

5

7

1.341641

2.889107

C1

C1

No

3

7

7

3.255764

3.749830

C2

C1

Yes

4

3

3

3.492850

1.943067

C2

C2

No

5

4

6

0.447214

1.943067

C1

C1

No

6

1

4

3.687818

3.574285

C1

C2

Yes

7

0

0

7.443118

6.169378

C2

C2

No

8

2

2

4.753946

3.347250

C2

C2

No

9

8

7

4.242641

4.463000

C2

C1

Yes

10

6

8

2.720294

4.113194

C1

C1

No

11

5

5

1.843909

0.958315

C2

C2

No

12

3

7

1

3.260775

C1

C1

No

C1 centroid

3.8

6.4

 

 

 

 

 

C2 centroid

4.571429

4.142857

 

 

 

 

 

The new clusters are plotted in the following graph. Note that the centroids are diverging, and several instances have changed their assignments.

Mastering Machine Learning with scikit-learn

Now we will move the centroids to the means of their constituents’ locations again, and re-assign the instances to their nearest centroids. The centroids continue to diverge, as shown in the following figure.

Mastering Machine Learning with scikit-learn

None of the instances’ centroid assignments will change in the next iteration; K-Means will continue iterating until some stopping criteria is satisfied. Usually, this criteria is either a threshold for the difference between the values of the cost function for subsequent iterations, or a threshold for the change in the positions of the centroids between subsequent iterations. If these stopping criteria are small enough, K-Means will converge on an optimum. This optimum will not necessarily be the global optimum.

Local Optima

Recall that K-Means initially sets the positions of the clusters’ centroids to the positions of randomly selected observations. Sometimes the random initialization is unlucky, and the centroids are set to positions that cause K-Means to converge to a local optimum. For example, assume that K-Means randomly initializes two cluster centroids to the following positions:

Mastering Machine Learning with scikit-learn

K-Means will eventually converge on a local optimum like that shown in the following figure. These clusters may be informative, but it is more likely that the top and bottom groups of observations are more informative clusters. To avoid local optima, K-Means is often repeated dozens or hundreds of times. In each iteration, it is randomly initialized to different starting cluster positions. The initialization that minimizes the cost function best is selected.

Mastering Machine Learning with scikit-learn

The Elbow Method

If K is not specified by the problem’s context, the optimal number of clusters can be estimated using a technique called the elbow method. The elbow method plots the value of the cost function produced by different values of K. As K increases, the average distortion will decrease; each cluster will have fewer constituent instances, and the instances will be closer to their respective centroids. However, the improvements to the average distortion will decline as K increases. The value of K at which the improvement to the distortion declines the most is called the elbow.

Let’s use the elbow method to choose the number of clusters for a data set. The following scatter plot visualizes a data set with two obvious clusters.

Mastering Machine Learning with scikit-learn

We will calculate and plot the mean distortion of the clusters for each value of K from one to ten with the following:

>>> import numpy as np
>>> from sklearn.cluster import KMeans
>>> from scipy.spatial.distance import cdist
>>> import matplotlib.pyplot as plt

>>> cluster1 = np.random.uniform(0.5, 1.5, (2, 10))
>>> cluster2 = np.random.uniform(3.5, 4.5, (2, 10))
>>> X = np.hstack((cluster1, cluster2)).T
>>> X = np.vstack((x, y)).T

>>> K = range(1, 10)
>>> meandistortions = []
>>> for k in K:
>>> kmeans = KMeans(n_clusters=k)
>>> kmeans.fit(X)
>>> meandistortions.append(sum(np.min(cdist(X, kmeans.cluster_
centers_, 'euclidean'), axis=1)) / X.shape[0])
>>> plt.plot(K, meandistortions, 'bx-')
>>> plt.xlabel('k')
>>> plt.ylabel('Average distortion')
>>> plt.title('Selecting k with the Elbow Method')
>>> plt.show()

Mastering Machine Learning with scikit-learn

The average distortion improves rapidly as we increase K from one to two. There is little improvement for values of K greater than two. Now let’s use the elbow method on the following data set with three clusters:

Mastering Machine Learning with scikit-learn

The following is the elbow plot for the data set. From this we can see that the rate of improvement to the average distortion declines the most when adding a fourth cluster. That is, the elbow method confirms that K should be set to three for this data set.

Mastering Machine Learning with scikit-learn

Summary

In this article we explained what clustering is and we talked about the two methods available for clustering

Resources for Article:


Further resources on this subject:


LEAVE A REPLY

Please enter your comment!
Please enter your name here