17 min read

In this article by Magnus Vilhelm Persson, author of the book Mastering Python Data Analysis, we will see that with data comprising of several separated distributions, how do we find and characterize them? In this article, we will look at some ways to identify clusters in data. Groups of points with similar characteristics form clusters. There are many different algorithms and methods to achieve this, with good and bad points. We want to detect multiple separate distributions in the data; for each point, we determine the degree of association (or similarity) with another point or cluster. The degree of association needs to be high if they belong in a cluster together or low if they do not. This can, of course, just as previously, be a one-dimensional problem or a multidimensional problem. One of the inherent difficulties of cluster finding is determining how many clusters there are in the data. Various approaches to define this exist—some where the user needs to input the number of clusters and then the algorithm finds which points belong to which cluster, and some where the starting assumption is that every point is a cluster and then two nearby clusters are combined iteratively on trial basis to see if they belong together.

In this article, we will cover the following topics:

  • A short introduction to cluster finding, reminding you of the general problem and an algorithm to solve it
  • Analysis of a dataset in the context of cluster finding, the Cholera outbreak in central London 1854
  • By Simple zeroth order analysis, calculating the centroid of the whole dataset
  • By finding the closest water pump for each recorded Cholera-related death

Applying the K-means nearest neighbor algorithm for cluster finding to the data and identifying two separate distributions

(For more resources related to this topic, see here.)

The algorithms and methods covered here are focused on those available in SciPy.

Start a new Notebook, and put in the default imports. Perhaps you want to change to interactive Notebook plotting to try it out a bit more. For this article, we are adding the following specific imports. The ones related to clustering are from SciPy, while later on we will need some packages to transform astronomical coordinates. These packages are all preinstalled in the Anaconda Python 3 distribution and have been tested there:

import scipy.cluster.hierarchy as hac
import scipy.cluster.vq as vq

Introduction to cluster finding

There are many different algorithms for cluster identification. Many of them try to solve a specific problem in the best way. Therefore, the specific algorithm that you want to use might depend on the problem you are trying to solve and also on what algorithms are available in the specific package that you are using.

Some of the first clustering algorithms consisted of simply finding the centroid positions that minimize the distances to all the points in each cluster. The points in each cluster are closer to that centroid than other cluster centroids. As might be obvious at this point, the hardest part with this is figuring out how many clusters there are. If we can determine this, it is fairly straightforward to try various ways of moving the cluster centroid around, calculate the distance to each point, and then figure out where the cluster centroids are. There are also obvious situations where this might not be the best solution, for example, if you have two very elongated clusters next to each other.

Commonly, the distance is the Euclidean distance:

 Mastering Python Data Analysis

Here, p is a vector with all the points’ positions,that is,{p1,p2,…,pN–1,pN} in cluster Ck, that is P E Ck , the distances are calculated from the cluster centroid,Ui . We have to find the cluster centroids that minimize the sum of the absolute distances to the points:

 Mastering Python Data Analysis

In this first example, we shall first work with fixed cluster centroids.

Starting out simple – John Snow on Cholera

In 1854, there was an outbreak of cholera in North-western London, in the neighborhood around Broad Street. The leading theories at the time claimed that cholera spread, just like it was believed that the plague spread: through foul, bad air. John Snow, a physician at the time, hypothesized that cholera spread through drinking water. During the outbreak, John tracked the deaths and drew them on a map of the area. Through his analysis, he concluded that most of the cases were centered on the Broad Street water pump. Rumors say that he then removed the handle of the water pump, thus stopping an epidemic. Today, we know that cholera is usually transmitted through contaminated food or water, thus confirming John’s hypothesis. We will do a short but instructive reanalysis of John Snow’s data.

The data comes from the public data archives of The National Center for Geographic Information and Analysis (http://www.ncgia.ucsb.edu/ and http://www.ncgia.ucsb.edu/pubs/data.php). A cleaned up map and copy of the data files along with an example of Geospatial information analysis of the data can also be found at https://www.udel.edu/johnmack/frec682/cholera/cholera2.html.A wealth of information about physician and scientist John Snow’s life and works can be found at http://johnsnow.matrix.msu.edu.

To start the analysis, we read the data into a Pandas DataFrame; the data is already formatted in CSV-files readable by Pandas:

deaths = pd.read_csv('data/cholera_deaths.txt')
pumps = pd.read_csv('data/cholera_pumps.txt')

Each file contains two columns, one for X coordinates and one for Y coordinates. Let’s check what it looks like:

deaths.head()

 Mastering Python Data Analysis

pumps.head()

 Mastering Python Data Analysis

With this information, we can now plot all the pumps and deaths to visualize the data:

plt.figure(figsize=(4,3.5)) 
plt.plot(deaths['X'], deaths['Y'], 
         marker='o', lw=0, mew=1, mec='0.9', ms=6)
plt.plot(pumps['X'],pumps['Y'], 
         marker='s', lw=0, mew=1, mec='0.9', color='k', ms=6)
plt.axis('equal')
plt.xlim((4.0,22.0));
plt.xlabel('X-coordinate')
plt.ylabel('Y-coordinate')
plt.title('John Snow's Cholera')

 Mastering Python Data Analysis

It is fairly easy to see that the pump in the middle is important. As a first data exploration, we will simply calculate the mean centroid of the distribution and plot that in the figure as an ellipse. We will calculate the mean and standard deviation along the x and y axis as the centroid position:

fig = plt.figure(figsize=(4,3.5)) 
ax = fig.add_subplot(111)
plt.plot(deaths['X'], deaths['Y'], 
         marker='o', lw=0, mew=1, mec='0.9', ms=6)
plt.plot(pumps['X'],pumps['Y'], 
         marker='s', lw=0, mew=1, mec='0.9', color='k', ms=6)

from matplotlib.patches import Ellipse
ellipse = Ellipse(xy=(deaths['X'].mean(), deaths['Y'].mean()), 
                  width=deaths['X'].std(), 
                  height=deaths['Y'].std(),
                  zorder=32, fc='None', ec='IndianRed', lw=2)
ax.add_artist(ellipse)
plt.plot(deaths['X'].mean(), deaths['Y'].mean(), 
         '.', ms=10, mec='IndianRed', zorder=32)
for i in pumps.index:
    plt.annotate(s='{0}'.format(i), xy=(pumps[['X','Y']].loc[i]), 
                 xytext=(-15,6), textcoords='offset points')
plt.axis('equal')
plt.xlim((4.0,22.5))
plt.xlabel('X-coordinate')
plt.ylabel('Y-coordinate')
plt.title('John Snow's Cholera')

 Mastering Python Data Analysis

Here, we also plotted the pump index, which we can get from DataFrame with the pumps.index method. The next step in the analysis is to see which pump is the closest to each point. We do this by calculating the distance from all pumps to all points. Then we want to figure out which pump is the closest for each point.

We save the closest pump to each point in a separate column of the deaths’ DataFrame. With this dataset, the for-loop runs fairly quickly. However, the DataFrame subtract method chained with sum() and idxmin() methods takes a few seconds. I strongly encourage you to play around with various ways to speed this up. We also use the .apply() method of DataFrame to square and square root the values. The simple brute force first attempt of this took over a minute to run. The built-in functions and methods help a lot:

deaths_tmp = deaths[['X','Y']].as_matrix()
idx_arr = np.array([], dtype='int')
for i in range(len(deaths)):
   idx_arr = np.append(idx_arr,
             (pumps.subtract(deaths_tmp[i])).apply(lambda 
              x:x**2).sum(axis=1).apply(lambda x:x**0.5).idxmin())
deaths['C'] = idx_arr

Quickly check whether everything seems fine by printing out the top rows of the table:

deaths.head()

 Mastering Python Data Analysis

Now we want to visualize what we have. With colors, we can show which water pump we associate each death with. To do this, we use a colormap, in this case, the jet colormap. By calling the colormap with a value between 0 and 1, it returns a color; thus we give it the pump indexes and then divide it with the total number of pumps, 12 in our case:

fig = plt.figure(figsize=(4,3.5))
ax = fig.add_subplot(111)
np.unique(deaths['C'].values)
plt.scatter(deaths['X'].as_matrix(), deaths['Y'].as_matrix(),
            color=plt.cm.jet(deaths['C']/12.), 
            marker='o', lw=0.5, edgecolors='0.5', s=20)
plt.plot(pumps['X'],pumps['Y'], 
         marker='s', lw=0, mew=1, mec='0.9', color='0.3', ms=6)
for i in pumps.index:
   plt.annotate(s='{0}'.format(i), xy=(pumps[['X','Y']].loc[i]), 
                xytext=(-15,6), textcoords='offset points', 
                ha='right')
ellipse = Ellipse(xy=(deaths['X'].mean(), deaths['Y'].mean()), 
                  width=deaths['X'].std(), 
                  height=deaths['Y'].std(),
                  zorder=32, fc='None', ec='IndianRed', lw=2)
ax.add_artist(ellipse)
plt.axis('equal')
plt.xlim((4.0,22.5))
plt.xlabel('X-coordinate')
plt.ylabel('Y-coordinate')
plt.title('John Snow's Cholera')

 Mastering Python Data Analysis

The majority of deaths are dominated by the proximity of the pump in the center. This pump is located on Broad Street.

Now, remember that we have used fixed positions for the cluster centroids. In this case, we are basically working on the assumption that the water pumps are related to the cholera cases. Furthermore, the Euclidean distance is not really the real-life distance. People go along roads to get water and the road there is not necessarily straight. Thus, one would have to map out the streets and calculate the distance to each pump from that. Even so, already at this level, it is clear that there is something with the center pump related to the cholera cases. How would you account for the different distance? To calculate the distance, you would do what is called cost-analysis (c.f. when you hit directions on your sat-nav to go to a place). There are many different ways of doing cost analysis, and it also relates to the problem of finding the correct way through a maze.

In addition to these things, we do not have any data in the time domain, that is, the cholera would possibly spread to other pumps with time and thus the outbreak might have started at the Broad Street pump and spread to other nearby pumps. Without time data, it is difficult to figure out what happened.

This is the general approach to cluster finding. The coordinates might be attributes instead, length and weight of dogs for example, and the location of the cluster centroid something that we would iteratively move around until we find the best position.

K-means clustering

The K-means algorithm is also referred to as vector quantization. What the algorithm does is find the cluster (centroid) positions that minimize the distances to all points in the cluster. This is done iteratively; the problem with the algorithm is that it can be a bit greedy, meaning that it will find the nearest minima quickly. This is generally solved with some kind of basin-hopping approach where the nearest minima found is randomly perturbed and the algorithm restarted. Due to this fact, the algorithm is dependent on good initial guesses as input.

Suicide rate versus GDP versus absolute lattitude

We will analyze the data of suicide rates versus GDP versus absolute lattitude or Degrees From Equator (DFE) for clusters. Our hypothesis from the visual inspection was that there were at least two distinct clusters, one with higher suicide rate, GDP, and absolute lattitude and one with lower. Here, the Hierarchical Data Format (HDF) file is now read in as a DataFrame. This time, we want to discard all the rows where one or more column entries are NaN or empty. Thus, we use the appropriate DataFrame method for this:

TABLE_FILE = 'data/data_ch4.h5'
d2 = pd.read_hdf(TABLE_FILE)
d2 = d2.dropna()

Next, while the DataFrame is a very handy format, which we will utilize later on, the input to the cluster algorithms in SciPy do not handle Pandas data types natively. Thus, we transfer the data to a NumPy array:

rates = d2[['DFE','GDP_CD','Both']].as_matrix().astype('float')

Next, to recap, we visualise the data by one histogram of the GDP and one scatter plot for all the data. We do this to aid us in the initial guesses of the cluster centroid positions:

plt.subplots(12, figsize=(8,3.5))
plt.subplot(121)
plt.hist(rates.T[1], bins=20,color='SteelBlue')
plt.xticks(rotation=45, ha='right')
plt.yscale('log')
plt.xlabel('GDP')
plt.ylabel('Counts')
plt.subplot(122)
plt.scatter(rates.T[0], rates.T[2], 
            s=2e5*rates.T[1]/rates.T[1].max(),
            color='SteelBlue', edgecolors='0.3');
plt.xlabel('Absolute Latitude (Degrees, 'DFE')')
plt.ylabel('Suicide Rate (per 100')')
plt.subplots_adjust(wspace=0.25);

 Mastering Python Data Analysis

The scatter plots shows the GDP as size. The function to run the clustering k-means takes a special kind of normalized input. The data arrays (columns) have to be normalized by the standard deviation of the array. Although this is straightforward, there is a function included in the module called whiten. It will scale the data with the standard deviation:

w = vq.whiten(rates)

To show what it does to the data, we plot the same plots as we did previously, but with the output from the whiten function:

plt.subplots(12, figsize=(8,3.5))
plt.subplot(121)
plt.hist(w[:,1], bins=20, color='SteelBlue')
plt.yscale('log')
plt.subplot(122)
plt.scatter(w.T[0], w.T[2], s=2e5*w.T[1]/w.T[1].max(), 
            color='SteelBlue', edgecolors='0.3')
plt.xticks(rotation=45, ha='right');

 Mastering Python Data Analysis

As you can see, all the data is scaled from the previous figure. However, as mentioned, the scaling is just the standard deviation. So let’s calculate the scaling and save it to the sc variable:

sc = rates.std(axis=0)

Now we are ready to estimate the initial guesses for the cluster centroids. Reading off the first plot of the data, we guess the centroids to be at 20 DFE, 200,000 GDP, and 10 suicides and the second at 45 DFE, 100,000 GDP, and 15 suicides. We put this in an array and scale it with our scale parameter to the same scale as the output from the whiten function. This is then sent to the kmeans2 function of SciPy:

init_guess = np.array([[20,20E3,10],[45,100E3,15]])
init_guess /= sc
z2_cb, z2_lbl = vq.kmeans2(w, init_guess, minit='matrix', 
                           iter=500)

There is another function, kmeans (without the 2), which is a less complex version and does not stop iterating when it reaches a local minima. It stops when the changes between two iterations go below some level. Thus, the standard K-means algorithm is represented in SciPy by the kmeans2 function. The function outputs the centroids’ scaled positions (here z2_cb) and a lookup table (z2_lbl) telling us which row belongs to which centroid. To get the centroid positions in units we understand, we simply multiply with our scaling value:

z2_cb_sc = z2_cb * sc

At this point, we can plot the results. The following section is rather long and contains many different parts so we will go through them section by section. However, the code should be run in one cell of the Notebook:

# K-means clustering figure START
plt.figure(figsize=(6,4))
plt.scatter(z2_cb_sc[0,0], z2_cb_sc[0,2], 
            s=5e2*z2_cb_sc[0,1]/rates.T[1].max(), 
            marker='+', color='k', 
            edgecolors='k', lw=2, zorder=10, alpha=0.7);
plt.scatter(z2_cb_sc[1,0], z2_cb_sc[1,2], 
            s=5e2*z2_cb_sc[1,1]/rates.T[1].max(), 
            marker='+', color='k', edgecolors='k', lw=3, 
            zorder=10, alpha=0.7);

The first steps are quite simple—we set up the figure size and plot the points of the cluster centroids. We hypothesized about two clusters; thus, we plot them with two different calls to plt.scatter. Here, z2_cb_sc[1,0] gets the second cluster x-coordinate (DFE); then switching 0 for 1 gives us the y coordinate (rate). We set the size of the marker s-parameter to scale with the value of the third data axis, the GDP. We also do this further down for the data, just as in previous plots, so that it is easier to compare and differentiate the clusters. The zorder keyword gives the order in depth of the elements that are plotted; a high zorder will put them on top of everything else and a negative zorder will send them to the back.

s0 = abs(z2_lbl==0).astype('bool')
s1 = abs(z2_lbl==1).astype('bool')
pattern1 = 5*'x'
pattern2 = 4*'/'
plt.scatter(w.T[0][s0]*sc[0], 
            w.T[2][s0]*sc[2], 
            s=5e2*rates.T[1][s0]/rates.T[1].max(),
            lw=1,
            hatch=pattern1,
            edgecolors='0.3',
            color=plt.cm.Blues_r(
                rates.T[1][s0]/rates.T[1].max()));
plt.scatter(rates.T[0][s1],
            rates.T[2][s1], 
            s=5e2*rates.T[1][s1]/rates.T[1].max(),
            lw=1,
            hatch=pattern2,
            edgecolors='0.4',
            marker='s',
            color=plt.cm.Reds_r(
                rates.T[1][s1]/rates.T[1].max()+0.4))

In this section, we plot the points of the clusters. First, we get the selection (Boolean) arrays. They are simply found by setting all indexes that refer to cluster 0 to True and all else to False; this gives us the Boolean array for cluster 0 (the first cluster). The second Boolean array is matched for the second cluster (cluster 1). Next, we define the hatch pattern for the scatter plot markers, which we later give as input to the plotting function. The multiplier for the hatch pattern gives the density of the pattern. The scatter plots for the points are created in a similar fashion as the centroids, except that the markers are a bit more complex. They are both color-coded, like in the previous example with Cholera deaths, but in a gradient instead of the exact same colors for all points. The gradient is defined by the GDP, which also defines the size of the points. The x and y data sent to the plot is different between the clusters, but they access the same data in the end because we multiply with our scaling factor.

p1 = plt.scatter([],[], hatch='None', 
                 s=20E3*5e2/rates.T[1].max(), 
                 color='k', edgecolors='None',)
p2 = plt.scatter([],[], hatch='None',
                 s=40E3*5e2/rates.T[1].max(),  
                 color='k', edgecolors='None',)
p3 = plt.scatter([],[], hatch='None',
                 s=60E3*5e2/rates.T[1].max(), 
                 color='k', edgecolors='None',)
p4 = plt.scatter([],[], hatch='None',
                 s=80E3*5e2/rates.T[1].max(), 
                 color='k', edgecolors='None',)
labels = ["20'", "40'", "60'", ">80'"]
plt.legend([p1, p2, p3, p4], labels, ncol=1, 
           frameon=True, #fontsize=12,
           handlelength=1, loc=1, 
           borderpad=0.75,labelspacing=0.75,
           handletextpad=0.75, title='GDP', scatterpoints=1.5)
plt.ylim((-4,40))
plt.xlim((-4,80))
plt.title('K-means clustering')
plt.xlabel('Absolute Lattitude (Degrees, 'DFE')')
plt.ylabel('Suicide Rate (per 100 000)');

The last tweak to the plot is made by creating a custom legend. We want to show different sizes of the points and what GDP they correspond to. As there is a continuous gradient from low to high, we cannot use the plotted points. Thus we create our own, but leave the x and y input coordinates as empty lists. This will not show anything in the plot but we can use them to register in the legend. The various tweaks to the legend function controls different aspects of the legend layout. I encourage you to experiment with it to see what happens:

 Mastering Python Data Analysis

As for the final analysis, two different clusters are identified. Just as our previous hypothesis, there is a cluster with a clear linear trend with relatively higher GDP, which is also located at higher absolute latitude. Although the identification is rather weak, it is clear that the two groups are separated. Countries with low GDP are clustered closer to the equator. What happens when you add more clusters? Try to add a cluster for the low DFE, high rate countries, visualize it, and think about what this could mean for the conclusion(s).

Summary

In this article, we identified clusters using methods such as finding the centroid positions and K-means clustering.

For more information about Python Data Analysis, refer to the following books by Packt Publishing:

  • Python Data Analysis (https://www.packtpub.com/big-data-and-business-intelligence/python-data-analysis)
  • Getting Started with Python Data Analysis (https://www.packtpub.com/big-data-and-business-intelligence/getting-started-python-data-analysis)  

Resources for Article:


Further resources on this subject:


LEAVE A REPLY

Please enter your comment!
Please enter your name here