K-Means Clustering

Clustering is one of the very important data mining and machine learning techniques. Clustering is a procedure for discovering groups of closely related elements in the dataset. Many a times we want to cluster the data into some categories, such as grouping similar users, modeling user behavior, identifying species of Irises, categorizing news items, classifying textual documents, and more. One of the most common clustering method is K-Means, which is a simple iterative method to partition the data into K - clusters.

Algorithm

Before we apply K-means to cluster data, it is required to express the data as vectors. In most of the cases, the data is given as a matrix of type [nSamples, nAttributes], which can be thought of as nSamples vectors each with a dimension of nAttributes.

There are certain cases where some work has to be done to render the data into linear algebraic language, such as:

A corpus of textual documents - We compute Term-frequency of a text document in the corpus as a vector of dimension=(vocabulary of the corpus) where the coefficient of each dimension is the frequency in the document of the word corresponding to the dimension.

document1 = freq(word_1), freq(word_2), .....,freq(word_n)

There are other choices of creating vectors from text documents, such as TFIDF vectors, binary vectors and more.

If I'm trying to cluster by Twitter friends, I can represent each friend as a vector:

number of followers, number of friends, number of tweets, count of favorite tweets

After we have vectors representing data points, we will cluster these data vectors into K clusters using the following algorithm.

Initialize the procedure by randomly selecting K vectors as cluster centroids.
For each vector compute its Euclidean distance with each of the centroids and assign the vector to its closest centroid.
When all of the objects have been assigned, recalculate the centroids as the mean (average) of all members of the cluster.
Repeat the previous two steps until convergence – when the clusters no longer change.

You can also choose other distance measures such as: Cosine similarity, Pearson correlation, Manhatten distance, and so on.

Example

We'll do cluster analysis of the wine dataset. This data contains 13 chemical measurements on 178 Italian wine samples. The data is taken from the UCI Machine Learning Repository. I'll use R to do this analysis, but it can very easily be done in other programing languages such as Python.

We'll use the R package rattle, which is GUI for data mining in R. We use rattle only to access data from the UCI Machine Learning Repository.

           # install rattle
           install.packages("rattle")
           library(rattle)
           
           # load data
           data(wine)
           
           # what does the data look like
           head(wine)
             Type Alcohol Malic  Ash Alcalinity Magnesium Phenols Flavanoids Nonflavanoids Proanthocyanins Color  Hue Dilution Proline 
           1    1   14.23  1.71 2.43       15.6       127    2.80       3.06  0.28            2.29         5.64   1.04     3.92    1065
           2    1   13.20  1.78 2.14       11.2       100    2.65       2.76  0.26            1.28         4.38   1.05     3.40    1050
           3    1   13.16  2.36 2.67       18.6       101    2.80       3.24  0.30            2.81         5.68   1.03     3.17    1185
           4    1   14.37  1.95 2.50       16.8       113    3.85       3.49  0.24            2.18         7.80   0.86     3.45    1480
           5    1   13.24  2.59 2.87       21.0       118    2.80       2.69  0.39            1.82         4.32   1.04     2.93     735
           6    1   14.20  1.76 2.45       15.2       112    3.27       3.39  0.34            1.97         6.75   1.05     2.85    1450

The first column contains the types of the wine. We will use K-Means as a learning model to predict the types:

# remove te first column 
           input <- scale(wine[-1])

A drawback of K-means clustering is that we have to pre-decide on the number of clusters. First, we'll define a function to compute the optimal number of clusters by looking at the clusters sum of squares for a different number of clusters.

wssplot <- function(data, nc=15, seed=1234){
               wss <- (nrow(data)-1)*sum(apply(data,2,var))
               for (i in 2:nc){
                    set.seed(seed)
                    wss[i] <- sum(kmeans(data, centers=i)$withinss)}
                plot(1:nc, wss, type="b", xlab="Number of Clusters",
                     ylab="Within groups sum of squares")}

Now we'll compute the optimal number of clusters using the wss function we defined above:

pdf("Number of Clusters.pdf")
           wssplot(input)
           dev.off()

k-means-clustering-img-0

The plot shows within groups the sums of squares vs. the number of clusters extracted. The sharp decreases from 1 to 3 clusters (with a little decrease after) suggesting a 3-cluster solution. This shows that the optimal number of clusters is 3.

Now we will cluster the data into 3 clusters using the kmeans() function in R.

set.seed(1234)
           # Clusters 
           fit <- kmeans(input, 3, nstart=25)

Let's plot the clusters:

# Plot the Clusters
           require(graphics)
           plot(input, col=fit$cluster)
           points(fit$centers, col=1:3, pch = 8, cex = 2)

k-means-clustering-img-1

We can also visualize the clustered data more transparently using the R package ggplot2:

# ggplot visual
           df <- data.frame(input)
           df$cluster <- factor(fit$cluster)
           centers <- as.data.frame(fit$centers)
           require(ggplot2)
           ggplot(data=df, aes(x=Alcohol,y=Malic,color=cluster)) + geom_point()

The sizes of the 3 clusters can be computed as:

# Size of the Clusters
           size <- fit$size
           size
           >>> [1] 62 65 51

Thus we have three clusters of wines of size 62, 65 and 51.

The means of the columns (chemicals) for each of the cluster can be computed using the aggregate function.

# Means of the columns for the Clusters
           mean_coulmns <- aggregate(input, by=list(fit$cluster), FUN=mean)
           mean_columns

Let's now measure how good this clustering is. We can use K-Means as a predictive model to assign new data points to one of the 3 clusters. First, we should check how well this assignment is for the training set.

A metric for this evaluation is called Cross Tabulation. This is a table comparing the type assigned by clustering and original values.

           # Measuring How Good is the Clustering
           # Cross Tabulation : A table comparing type assigned by clustering and original values
           
           cross <- table(wine$Type, fit$cluster)
           cross
           >>> 
              1  2  3
           1 59  0  0
           2  3 65  3
           3  0  0 48

This shows that the clustering gives a pretty good prediction.

Let's fit cluster centers to each observation:

fit_centers <- fitted(fit)
           # Residue
           residue <- input - fitted(fit)

This shows that the clustering gives a pretty good prediction.

Assign to each observation the corresponding cluster:

mydata <- data.frame(input, fit$cluster)
           write.table(mydata, file="clustered_observations.csv", sep=",", row.names=F, col.names=T, quote=F)

The full code is available here.

About the Author

Janu Verma is a Quantitative Researcher at the Buckler Lab, Cornell University, where he works on problems in bioinformatics and genomics. His background is in mathematics and machine learning and he leverages tools from these areas to answer questions in biology.