6 min read

In this article by Atul Tripathi, author of the book Machine Learning Cookbook, we will cover hierarchical clustering with a World Bank sample dataset.

(For more resources related to this topic, see here.)

Introduction

Hierarchical clustering is one of the most important methods in unsupervised learning is hierarchical clustering. In hierarchical clustering for a given set of data points, the output is produced in the form of a binary tree (dendrogram). In the binary tree, the leaves represent the data points while internal nodes represent nested clusters of various sizes. Each object is assigned to a separate cluster. Evaluation of all the clusters shall take place based on a pairwise distance matrix. The distance matrix shall be constructed using distance values. The pair of clusters with the shortest distance must be considered. The pair identified should then be removed from the matrix and merged together. The merged clusters must be distance must be evaluated with the other clusters and the distance matrix must be updated. The process must be repeated until the distance matrix is reduced to a single element.

Hierarchical clustering – World Bank sample dataset

One of the main goals for establishing the World Bank has been to fight and eliminate poverty. Continuous evolution and fine tuning its policies in the ever-evolving world has been helping the institution to achieve the goal of poverty elimination. The barometer of success in elimination of poverty is measured in terms of improvement of each of the parameters in health, education, sanitation, infrastructure, and other services needed to improve the lives of poor. The development gains which will ensure the goals must be pursued in an environmentally, socially, and economically sustainable manner.

Getting ready

In order to perform hierarchical clustering, we shall be using a dataset collected from the World Bank dataset.

 Step 1 – Collecting and describing data

The dataset titled WBClust2013 shall be used. This is available in the CSV format titled WBClust2013.csv. The dataset is in standard format. There are 80 rows of data. There are 14 variables. The numeric variables are:

  • new.forest
  • Rural
  • log.CO2
  • log.GNI
  • log.Energy.2011
  • LifeExp
  • Fertility
  • InfMort
  • log.Exports
  • log.Imports
  • CellPhone
  • RuralWater
  • Pop

The non-numeric variables are:

  • Country

How to do it

Step 2 – exploring data

Version info: Code for this page was tested in R version 3.2.3 (2015-12-10)

Let’s explore the data and understand the relationships among the variables. We’ll begin by importing the CSV file named WBClust2013.csv. We will be saving the data to the wbclust data frame:

> wbclust=read.csv("d:/WBClust2013.csv",header=T)

Next, we shall print the wbclust data frame. The head() function returns the wbclust data frame. The wbclust data frame is passed as an input parameter:

> head(wbclust)

The results are as follows:

Machine Learning Cookbook

Step 3 – transforming data

Centering variables and creating z-scores are two common data analysis activities to standardize data. The numeric variables mentioned above need to create z-scores. The scale()function is a generic function whose default method centers and/or scales the columns of a numeric matrix. The data frame, wbclust is passed to the scale function. All the numeric fields are only considered. The result is then stored in another data frame, wbnorm.

> wbnorm<-scale(wbclust[,2:13])
> wbnorm

The results are as follows:

Machine Learning Cookbook

All data frames have a row names attribute. In order to retrieve or set the row or column names of a matrix-like object, the rownames()function is used. The data frame, wbclust with the first column is passed to the rownames()function.

> rownames(wbnorm)=wbclust[,1]
> rownames(wbnorm)

The call to the function rownames(wbnorm)results in display of the values from the first column. The results are as follows:

Machine Learning Cookbook

Step 4 – training and evaluating the model performance

The next step is about training the model. The first step is to calculate the distance matrix. The dist()function is used. Using the specified distance measure, distances between the rows of a data matrix are computed. The distance measure used can be Euclidean, maximum, Manhattan, Canberra, binary, or Minkowski. The distance measure used is Euclidean. The Euclidean distance calculates the distance between two vectors as sqrt(sum((x_i – y_i)^2)).The result is then stored in a new data frame, dist1.

> dist1<-dist(wbnorm, method="euclidean")

The next step is to perform clustering using Ward’s method. The hclust() function is used. In order to perform cluster analysis on a set of dissimilarities of the n objects, the hclust()function is used. At the first stage, each of the objects is assigned to its own cluster. After which, at each stage the algorithm iterates and joins two of the most similar clusters. This process will continue till there is just a single cluster left. The hclust() function requires that we provide the data in the form of a distance matrix. The dist1 data frame is passed. By default, the complete linkage method is used. There are multiple agglomeration methods which can be used. Some of the agglomeration methods could be ward.D, ward.D2, single, complete, average.

> clust1<-hclust(dist1,method="ward.D")
> clust1

The call to the function, clust1results in display of the agglomeration methods used, the manner in which the distance is calculated, and the number of objects. The results are as follows:

Machine Learning Cookbook

Step 5 – plotting the model

The plot()function is a generic function for plotting of R objects. Here, the plot() function is used to draw the dendrogram:

> plot(clust1,labels= wbclust$Country, cex=0.7, xlab="",ylab="Distance",main="Clustering for 80 Most Populous Countries")

The result is as follows:

Machine Learning Cookbook

The rect.hclust() function highlights the clusters and draws the rectangles around the branches of the dendrogram. The dendrogram is first cut at a certain level followed by drawing a rectangle around the selected branches.

The object, clust1 is passed as an object to the function along with the number of clusters to be formed:

> rect.hclust(clust1,k=5)

The result is as follows:

Machine Learning Cookbook

The cuts()function shall cut the tree into multiple groups on the basis of the desired number of groups or the cut height. Here, clust1 is passed as an object to the function along with the number of the desired group:

> cuts=cutree(clust1,k=5)
> cuts

The result is as follows:

Machine Learning Cookbook

Getting the list of countries in each group.

The result is as follows:

Machine Learning Cookbook

Summary

In this article we covered hierarchical clustering by collecting, exploring its contents, transforming the data. We trained and evaluated it by using distance matrix and finally plotted the data as a dendrogram.

Resources for Article:


Further resources on this subject:


LEAVE A REPLY

Please enter your comment!
Please enter your name here