Home Data Tutorials Hierarchical Clustering

Hierarchical Clustering

February 7, 2017 - 12:00 am

1261

6 min read

In this article by Atul Tripathi, author of the book Machine Learning Cookbook, we will cover hierarchical clustering with a World Bank sample dataset.

(For more resources related to this topic, see here.)

Introduction

Hierarchical clustering is one of the most important methods in unsupervised learning is hierarchical clustering. In hierarchical clustering for a given set of data points, the output is produced in the form of a binary tree (dendrogram). In the binary tree, the leaves represent the data points while internal nodes represent nested clusters of various sizes. Each object is assigned to a separate cluster. Evaluation of all the clusters shall take place based on a pairwise distance matrix. The distance matrix shall be constructed using distance values. The pair of clusters with the shortest distance must be considered. The pair identified should then be removed from the matrix and merged together. The merged clusters must be distance must be evaluated with the other clusters and the distance matrix must be updated. The process must be repeated until the distance matrix is reduced to a single element.

Hierarchical clustering – World Bank sample dataset

One of the main goals for establishing the World Bank has been to fight and eliminate poverty. Continuous evolution and fine tuning its policies in the ever-evolving world has been helping the institution to achieve the goal of poverty elimination. The barometer of success in elimination of poverty is measured in terms of improvement of each of the parameters in health, education, sanitation, infrastructure, and other services needed to improve the lives of poor. The development gains which will ensure the goals must be pursued in an environmentally, socially, and economically sustainable manner.

Getting ready

In order to perform hierarchical clustering, we shall be using a dataset collected from the World Bank dataset.

Step 1 – Collecting and describing data

The dataset titled WBClust2013 shall be used. This is available in the CSV format titled WBClust2013.csv. The dataset is in standard format. There are 80 rows of data. There are 14 variables. The numeric variables are:

new.forest
Rural
log.CO2
log.GNI
log.Energy.2011
LifeExp
Fertility
InfMort
log.Exports
log.Imports
CellPhone
RuralWater
Pop

The non-numeric variables are:

Country

How to do it

Step 2 – exploring data

Version info: Code for this page was tested in R version 3.2.3 (2015-12-10)

Let’s explore the data and understand the relationships among the variables. We’ll begin by importing the CSV file named WBClust2013.csv. We will be saving the data to the wbclust data frame:

> wbclust=read.csv("d:/WBClust2013.csv",header=T)

Next, we shall print the wbclust data frame. The head() function returns the wbclust data frame. The wbclust data frame is passed as an input parameter:

> head(wbclust)

The results are as follows:

Machine Learning Cookbook

Step 3 – transforming data

Centering variables and creating z-scores are two common data analysis activities to standardize data. The numeric variables mentioned above need to create z-scores. The scale()function is a generic function whose default method centers and/or scales the columns of a numeric matrix. The data frame, wbclust is passed to the scale function. All the numeric fields are only considered. The result is then stored in another data frame, wbnorm.

> wbnorm<-scale(wbclust[,2:13])
> wbnorm

The results are as follows:

Machine Learning Cookbook

All data frames have a row names attribute. In order to retrieve or set the row or column names of a matrix-like object, the rownames()function is used. The data frame, wbclust with the first column is passed to the rownames()function.

> rownames(wbnorm)=wbclust[,1]
> rownames(wbnorm)

The call to the function rownames(wbnorm)results in display of the values from the first column. The results are as follows:

Machine Learning Cookbook

Step 4 – training and evaluating the model performance

The next step is about training the model. The first step is to calculate the distance matrix. The dist()function is used. Using the specified distance measure, distances between the rows of a data matrix are computed. The distance measure used can be Euclidean, maximum, Manhattan, Canberra, binary, or Minkowski. The distance measure used is Euclidean. The Euclidean distance calculates the distance between two vectors as sqrt(sum((x_i – y_i)^2)).The result is then stored in a new data frame, dist1.

> dist1<-dist(wbnorm, method="euclidean")

The next step is to perform clustering using Ward’s method. The hclust() function is used. In order to perform cluster analysis on a set of dissimilarities of the n objects, the hclust()function is used. At the first stage, each of the objects is assigned to its own cluster. After which, at each stage the algorithm iterates and joins two of the most similar clusters. This process will continue till there is just a single cluster left. The hclust() function requires that we provide the data in the form of a distance matrix. The dist1 data frame is passed. By default, the complete linkage method is used. There are multiple agglomeration methods which can be used. Some of the agglomeration methods could be ward.D, ward.D2, single, complete, average.

> clust1<-hclust(dist1,method="ward.D")
> clust1

The call to the function, clust1results in display of the agglomeration methods used, the manner in which the distance is calculated, and the number of objects. The results are as follows:

Machine Learning Cookbook

Step 5 – plotting the model

The plot()function is a generic function for plotting of R objects. Here, the plot() function is used to draw the dendrogram:

> plot(clust1,labels= wbclust$Country, cex=0.7, xlab="",ylab="Distance",main="Clustering for 80 Most Populous Countries")

The result is as follows:

Machine Learning Cookbook

The rect.hclust() function highlights the clusters and draws the rectangles around the branches of the dendrogram. The dendrogram is first cut at a certain level followed by drawing a rectangle around the selected branches.

The object, clust1 is passed as an object to the function along with the number of clusters to be formed:

> rect.hclust(clust1,k=5)

The result is as follows:

Machine Learning Cookbook

The cuts()function shall cut the tree into multiple groups on the basis of the desired number of groups or the cut height. Here, clust1 is passed as an object to the function along with the number of the desired group:

> cuts=cutree(clust1,k=5)
> cuts

The result is as follows:

Machine Learning Cookbook

Getting the list of countries in each group.

The result is as follows:

Machine Learning Cookbook

Summary

In this article we covered hierarchical clustering by collecting, exploring its contents, transforming the data. We trained and evaluated it by using distance matrix and finally plotted the data as a dendrogram.

Resources for Article:

Further resources on this subject:

Supervised Machine Learning [article]
Specialized Machine Learning Topics [article]
Machine Learning Using Spark MLlib [article]

Top 6 Cybersecurity Books from Packt to Accelerate Your Career

Your Quick Introduction to Extended Events in Analysis Services from Blog…

Logging the history of my past SQL Saturday presentations from Blog…

Storage savings with Table Compression from Blog Posts – SQLServerCentral

Daily Coping 31 Dec 2020 from Blog Posts – SQLServerCentral

Learning Essential Linux Commands for Navigating the Shell Effectively

Exploring the Strategy Behavioral Design Pattern in Node.js

How to integrate a Medium editor in Angular 8

Implementing memory management with Golang’s garbage collector

How to create sales analysis app in Qlik Sense using DAR…

Hierarchical Clustering

Introduction

Hierarchical clustering – World Bank sample dataset

Getting ready

Step 1 – Collecting and describing data

How to do it

Step 2 – exploring data

Step 3 – transforming data

Step 4 – training and evaluating the model performance

Step 5 – plotting the model

Summary

Resources for Article:

LEAVE A REPLY Cancel reply

MobilePro

datapro

Programming

Subscribe to our newsletter