*[box type=”note” align=”” class=”” width=””]This is an excerpt from a book by John R. Hubbard titled **Java Data Analysis.** In this article, we see the four popular clustering algorithms: hierarchical clustering, k-means clustering, k-medoids clustering, and the affinity propagation algorithms along with their pseudo-codes.[/box]*

A clustering algorithm is one that identifies groups of data points according to their proximity to each other. These algorithms are similar to classification algorithms in that they also partition a dataset into subsets of similar points. But, in classification, we already have data whose classes have been identified. such as sweet fruit. In clustering, we seek to discover the unknown groups themselves.

## Hierarchical clustering

Of the several clustering algorithms that we will examine in this article, hierarchical clustering is probably the simplest. The trade-off is that it works well only with small datasets in Euclidean space.

The general setup is that we have a dataset *S* of *m* points in Rn which we want to partition into a given number *k* of clusters *C**1** , C**2** ,…, C**k* , where within each cluster the points are relatively close together.

Here is the algorithm:

- Create a singleton cluster for each of the
*m*data points. - Repeat
*m – k*times:

- Find the two clusters whose centroids are closest
- Replace those two clusters with a new cluster that contains their points

The centroid of a cluster is the point whose coordinates are the averages of the corresponding coordinates of the cluster points. For example, the centroid of the cluster C = {(2, 4), (3, 5), (6, 6), (9, 1)} is the point **(5, 4)**, because (2 + 3 + 6 + 9)/4 = 5 and (4 + 5 + 6 + 1)/4 = 4. This is illustrated in the figure below :

## K-means clustering

A popular alternative to hierarchical clustering is the K-means algorithm. It is related to the **K-Nearest Neighbor (KNN)** classification algorithm. As with hierarchical clustering, the K-means clustering algorithm requires the number of clusters, *k*, as input. (This version is also called the **K-Means++** algorithm)

Here is the algorithm:

- Select
*k*points from the dataset. - Create
*k*clusters, each with one of the initial points as its centroid. - For each dataset point
*x*that is not already a centroid:

- Find the centroid
*y*that is closest to*x* - Add
*x*to that centroid’s cluster - Re-compute the centroid for that cluster

It also requires *k* points, one for each cluster, to initialize the algorithm. These initial points can be selected at random, or by some a priori method. One approach is to run hierarchical clustering on a small sample taken from the given dataset and then pick the centroids of those resulting clusters.

## K-medoids clustering

The k-medoids clustering algorithm is similar to the k-means algorithm, except that each cluster center, called its medoid, is one of the data points instead of being the mean of its points. The idea is to minimize the average distances from the medoids to points in their clusters. The Manhattan metric is usually used for these distances. Since those averages will be minimal if and only if the distances are, the algorithm is reduced to minimizing the sum of all distances from the points to their medoids. This sum is called the **cost of the configuration**.

Here is the algorithm:

- Select
*k*points from the dataset to be medoids. - Assign each data point to its closest medoid. This defines the
*k*clusters. - For each cluster
*C**j*:

- Compute the sum
*s*= ∑ j*s*j , where each*s*j = ∑{*d*(**x**,**y****j**) :**x**∈*C**j*} , and change the medoid*y**j*to whatever point in the cluster Cj that minimizes s - If the medoid
*y**j*was changed, re-assign each*x*to the cluster whose medoid is closest

- Repeat step 3 until
*s*is minimal.

This is illustrated by the simple example in Figure 8.16. It shows 10 data points in 2 clusters. The two medoids are shown as filled points. In the initial configuration it is:

*C**1* = {(1,1),(2,1),(3,2) (4,2),(2,3)}, with **y****1** = **x****1** = (1,1)

*C**2* = {(4,3),(5,3),(2,4) (4,4),(3,5)}, with **y****2** = **x****10** = (3,5)

The sums are

*s*1 = *d *(**x****2****,y****1**) + *d *(**x****3****,y****1**) + *d *(**x****4****,y****1**) + *d *(**x****5****,y****1**) = 1 + 3 + 4 + 3 = 11

*s*2 = *d *(**x****6****,y****1**) + *d *(**x****7****,y****1**) + *d *(**x****8****,y****1**) + *d *(**x****9****,y****1**) = 3 + 4 + 2 + 2 = 11

*s = s*1 + *s*2 * *= 11 + 11 = 22

The algorithm at *step 3* first part changes the medoid for *C**1* to* y**1** = x**3** = (3,2)*. This causes the clusters to change, at *step 3* second part, to:

*C**1* = {(1,1),(2,1),(3,2) (4,2),(2,3),(4,3),(5,3)}, with **y****1** = **x****3** = (3,2)

*C**2* = {(2,4),(4,4),(3,5)}, with **y****2** = **x****10** = (3,5)

This makes the sums:

*s*1 = 3 + 2 + 1 + 2 + 2 + 3 = 13

*s*2 = 2 + 2 = 4

*s = s*1 + *s*2 * *= 13 + 4 = 17

The resulting configuration is shown in the second panel of the figure below:

At *step 3* of the algorithm, the process repeats for cluster *C*2. The resulting configuration is shown in the third panel of the above figure. The computations are:

*C**1* = {(1,1),(2,1),(3,2) (4,2),(4,3),(5,3)}, with **y****1** = **x****3** = (3,2)

*C**2* = {(2,3),(2,4),(4,4),(3,5)}, with **y****2** = **x****8** = (2,4)

*s = s*1 + *s*2 * *= (3 + 2 + 1 + 2 + 3) + (1 + 2 + 2) = 11 + 5 = 16

The algorithm continues with two more changes, finally converging to the minimal configuration shown in the fifth panel of the above figure. This version of k-medoid clustering is also called **partitioning around medoids (PAM)**.

## Affinity propagation clustering

One disadvantage of each of the clustering algorithms previously presented (hierarchical, k-means, k-medoids) is the requirement that the number of clusters k be determined in advance. The affinity propagation clustering algorithm does not have that requirement. Developed in 2007 by Brendan J. Frey and Delbert Dueck at the University of Toronto, it has become one of the most widely-used clustering methods.

Like k-medoid clustering, affinity propagation selects cluster center points, called **exemplars**, from the dataset to represent the clusters. This is done by **message-passing** between the data points.

The algorithm works with three two-dimensional arrays:

*s**ij* = the similarity between **x****i** and **x****j**

*r**ik* = responsibility: message from **x****i** to **x****k** on how well-suited **x****k** is as an exemplar for** x****i**

*a**ik* = availability: message from **x****k** to **x****i** on how well-suited **x****k** is as an exemplar for **x****i**

Here is the complete algorithm:

- Initialize the similarities:

*sij = –d(***x****i**,**x****j**)2 , for i ≠ j;- s
*ii*= the average of those other s*ij*values

2. Repeat until convergence:

- Update the responsibilities:

*r**ik* = s*ik* − max {*a**ij* + s *ij * : j ≠ k}

- Update the availabilities:

*a**ik* = min {0, *r**kk* + ∑j { max {0, *r**jk* } : *j ≠ i* ∧* j ≠ k* }}, for* i ≠ k*;

*a**kk* = ∑j { max {0, *r**jk* } : *j ≠ k* }

A point **x****k** will be an exemplar for a point **x****i** if *a**ik* + *r**ik* = maxj {*a**ij* + *r**ij*}.

*If you enjoyed this excerpt from the book **Java Data Analysis** by John R. Hubbard, check out the book to learn how to implement various machine learning algorithms, data visualization and more in Java.*