7 min read

 In this article by Jayani Withanawasam, author of the book Apache Mahout Essentials, we will see the clustering technique in machine learning and its implementation using Apache Mahout.

The K-Means clustering algorithm is explained in detail with both Java and command-line examples (sequential and parallel executions), and other important clustering algorithms, such as Fuzzy K-Means, canopy clustering, and spectral K-Means are also explored.

In this article, we will cover the following topics:

  • Unsupervised learning and clustering
  • Applications of clustering
  • Types of clustering
  • K-Means clustering
  • K-Means clustering with MapReduce

(For more resources related to this topic, see here.)

Unsupervised learning and clustering

Information is a key driver for any type of organization. However, with the rapid growth in the volume of data, valuable information may be hidden and go unnoticed due to the lack of effective data processing and analyzing mechanisms.

Clustering is an unsupervised learning mechanism that can find the hidden patterns and structures in data by finding data points that are similar to each other. No prelabeling is required. So, you can organize data using clustering with little or no human intervention.

For example, let’s say you are given a collection of balls of different sizes without any category labels, such as big and small, attached to them; you should be able to categorize them using clustering by considering their attributes, such as radius and weight, for similarity.

We will learn how to use Apache Mahout to perform clustering using different algorithms.

Applications of clustering

Clustering has many applications in different domains, such as biology, business, and information retrieval.

Computer vision and image processing

Clustering techniques are widely used in the computer vision and image processing domain. Clustering is used for image segmentation in medical image processing for computer aided disease (CAD) diagnosis. One specific area is breast cancer detection.

In breast cancer detection, a mammogram is clustered into several parts for further analysis, as shown in the following image. The regions of interest for signs of breast cancer in the mammogram can be identified using the K-Means algorithm.

Image features such as pixels, colors, intensity, and texture are used during clustering:

Types of clustering

Clustering can be divided into different categories based on different criteria.

Hard clustering versus soft clustering

Clustering techniques can be divided into hard clustering and soft clustering based on the cluster’s membership.

In hard clustering, a given data point in n-dimensional space only belongs to one cluster. This is also known as exclusive clustering. The K-Means clustering mechanism is an example of hard clustering.

A given data point can belong to more than one cluster in soft clustering. This is also known as overlapping clustering. The Fuzzy K-Means algorithm is a good example of soft clustering. A visual representation of the difference between hard clustering and soft clustering is given in the following figure:

Flat clustering versus hierarchical clustering

In hierarchical clustering, a hierarchy of clusters is built using the top-down (divisive) or bottom-up (agglomerative) approach. This is more informative and accurate than flat clustering, which is a simple technique where no hierarchy is present. However, this comes at the cost of performance, as flat clustering is faster and more efficient than hierarchical clustering.

For example, let’s assume that you need to figure out T-shirt sizes for people of different sizes. Using hierarchal clustering, you can come up with sizes for small (s), medium (m), and large (l) first by analyzing a sample of the people in the population. Then, we can further categorize this as extra small (xs), small (s), medium, large (l), and extra large (xl) sizes.

Model-based clustering

In model-based clustering, data is modeled using a standard statistical model to work with different distributions. The idea is to find a model that best fits the data. The best-fit model is achieved by tuning up parameters to minimize loss on errors. Once the parameter values are set, probability membership can be calculated for new data points using the model. Model-based clustering gives a probability distribution over clusters.

K-Means clustering

K-Means clustering is a simple and fast clustering algorithm that has been widely adopted in many problem domains. We will give a detailed explanation of the K-Means algorithm, as it will provide the base for other algorithms. K-Means clustering assigns data points to k number of clusters (cluster centroids) by minimizing the distance from the data points to the cluster centroids.

Let’s consider a simple scenario where we need to cluster people based on their size (height and weight are the selected attributes) and different colors (clusters):

We can plot this problem in two-dimensional space, as shown in the following figure and solve it using the K-Means algorithm:

Getting your hands dirty!

Let’s move on to a real implementation of the K-Means algorithm using Apache Mahout. The following are the different ways in which you can run algorithms in Apache Mahout:

  • Sequential
  • MapReduce

You can execute the algorithms using a command line (by calling the correct bin/mahout subcommand) or using Java programming (calling the correct driver’s run method).

Running K-Means using Java programming

This example continues with the people-clustering scenario mentioned earlier.

The size (weight and height) distribution for this example has been plotted in two-dimensional space, as shown in the following image:

Data preparation

First, we need to represent the problem domain as numerical vectors.

The following table shows the size distribution of people mentioned in the previous scenario:

Weight (kg)

Height (cm)

22

80

25

75

28

85

55

150

50

145

53

153

Save the following content in a file named KmeansTest.data:

22 80
25 75
28 85
55 150
50 145
53 153

Understanding important parameters

Let’s take a look at the significance of some important parameters:

  • org.apache.hadoop.fs.Path: This denotes the path to a file or directory in the filesystem.
  • org.apache.hadoop.conf.Configuration: This provides access to Hadoop-related configuration parameters.
  • org.apache.mahout.common.distance.DistanceMeasure: This determines the distance between two points.
  • K: This denotes the number of clusters.
  • convergenceDelta: This is a double value that is used to determine whether the algorithm has converged.
  • maxIterations: This denotes the maximum number of iterations to run.
  • runClustering: If this is true, the clustering step is to be executed after the clusters have been determined.
  • runSequential: If this is true, the K-Means sequential implementation is to be used in order to process the input data.

The following code snippet shows the source code:

private static final String DIRECTORY_CONTAINING_CONVERTED_INPUT =
"Kmeansdata";
public static void main(String[] args) throws Exception {
// Path to output folder
Path output = new Path("Kmeansoutput");
// Hadoop configuration details
Configuration conf = new Configuration();
HadoopUtil.delete(conf, output);
run(conf, new Path("KmeansTest"), output, new
EuclideanDistanceMeasure(), 2, 0.5, 10);
}
public static void run(Configuration conf, Path input, Path
output, DistanceMeasure measure, int k,
double convergenceDelta, int maxIterations) throws Exception {
// Input should be given as sequence file format
Path directoryContainingConvertedInput = new Path(output,
DIRECTORY_CONTAINING_CONVERTED_INPUT);
InputDriver.runJob(input, directoryContainingConvertedInput,
"org.apache.mahout.math.RandomAccessSparseVector");
// Get initial clusters randomly
Path clusters = new Path(output, "random-seeds");
clusters = RandomSeedGenerator.buildRandom(conf,
directoryContainingConvertedInput, clusters, k, measure);
// Run K-Means with a given K
KMeansDriver.run(conf, directoryContainingConvertedInput,
clusters, output, convergenceDelta,
maxIterations, true, 0.0, false);
// run ClusterDumper to display result
Path outGlob = new Path(output, "clusters-*-final");
Path clusteredPoints = new Path(output,"clusteredPoints");
ClusterDumper clusterDumper = new ClusterDumper(outGlob,
clusteredPoints);
clusterDumper.printClusters(null);
}

Use the following code example in order to get a better (readable) outcome to analyze the data points and the centroids they are assigned to:

Reader reader = new SequenceFile.Reader(fs,new Path(output,
Cluster.CLUSTERED_POINTS_DIR + "/part-m-00000"), conf);
IntWritable key = new IntWritable();
WeightedPropertyVectorWritable value = new
WeightedPropertyVectorWritable();
while (reader.next(key, value)) {
System.out.println("key: " + key.toString()+ " value: "+
value.toString());
}
reader.close();

After you run the algorithm, you will see the clustering output generated for each iteration and the final result in the filesystem (in the output directory you have specified; in this case, Kmeansoutput).

Summary

Clustering is an unsupervised learning mechanism that requires minimal human effort. Clustering has many applications in different areas, such as medical image processing, market segmentation, and information retrieval.

Clustering mechanisms can be divided into different types, such as hard, soft, flat, hierarchical, and model-based clustering based on different criteria.

Apache Mahout implements different clustering algorithms, which can be accessed sequentially or in parallel (using MapReduce).

The K-Means algorithm is a simple and fast algorithm that is widely applied. However, there are situations that the K-Means algorithm will not be able to cater to. For such scenarios, Apache Mahout has implemented other algorithms, such as canopy, Fuzzy K-Means, streaming, and spectral clustering.

Resources for Article:


Further resources on this subject:


LEAVE A REPLY

Please enter your comment!
Please enter your name here