[box type=”note” align=”” class=”” width=””]*This article is an excerpt from a book authored by Osvaldo Martin titled **Bayesian Analysis with Python.** This book will help you implement Bayesian analysis in your application and will guide you to build complex statistical problems using Python.*[/box]

Our article teaches you to build an end to end gaussian mixture model with a practical example.

The general idea when building a finite mixture model is that we have a certain number of subpopulations, each one represented by some distribution, and we have data points that belong to those distribution but we do not know to which distribution each point belongs. Thus we need to assign the points properly. We can do that by building a hierarchical model. At the top level of the model, we have a random variable, often referred as a latent variable, which is a variable that is not really observable. The function of this latent variable is to specify to which component distribution a particular observation is assigned to. That is, the latent variable decides which component distribution we are going to use to model a given data point. In the literature, people often use the letter z to indicate latent variables.

Let us start building mixture models with a very simple example. We have a dataset that we want to describe as being composed of three Gaussians.

```
clusters = 3
n_cluster = [90, 50, 75]
n_total = sum(n_cluster)
means = [9, 21, 35]
std_devs = [2, 2, 2]
mix = np.random.normal(np.repeat(means, n_cluster),
np.repeat(std_devs, n_cluster))
sns.kdeplot(np.array(mix))
plt.xlabel('$x$', fontsize=14)
```

In many real situations, when we wish to build models, it is often more easy, effective and productive to begin with simpler models and then add complexity, even if we know from the beginning that we need something more complex. This approach has several advantages, such as getting familiar with the data and problem, developing intuition, and avoiding choking us with complex models/codes that are difficult to debug.

So, we are going to begin by supposing that we know that our data can be described using three Gaussians (or in general, k-Gaussians), maybe because we have enough previous experimental or theoretical knowledge to reasonably assume this, or maybe we come to that conclusion by eyeballing the data. We are also going to assume we know the mean and standard deviation of each Gaussian.

Given this assumptions the problem is reduced to assigning each point to one of

the three possible known Gaussians. There are many methods to solve this task. We of course are going to take the Bayesian track and we are going to build a probabilistic model.

To develop our model, we can get ideas from the coin-flipping problem. Remember that we have had two possible outcomes and we used the Bernoulli distribution to describe them. Since we did not know the probability of getting heads or tails, we use a beta prior distribution. Our current problem with the Gaussians mixtures is similar, except that we now have k-Gaussian outcomes.

The generalization of the Bernoulli distribution to k-outcomes is the categorical distribution and the generalization of the beta distribution is the Dirichlet distribution. This distribution may look a little bit weird at first because it lives in the simplex, which is like an n-dimensional triangle; a 1-simplex is a line, a 2-simplex is a triangle, a 3-simplex a tetrahedron, and so on. Why a simplex? Intuitively, because the output of this distribution is a k-length vector, whose elements are restricted to be positive and sum up to one. To understand how the Dirichlet generalize the beta, let us first refresh a couple of features of the beta distribution. We use the beta for 2-outcome problems, one with probability p and the other 1-p. In this sense we can think that the beta returns a two-element vector, [p, 1-p]. Of course, in practice, we omit 1-p because it is fully determined by p. Another feature of the beta distribution is that it is parameterized using two scalars and . How does these features compare to the Dirichlet distribution? Let us think of the simplest Dirichlet distribution, one we could use to model a three-outcome problem. We get a Dirichlet distribution that returns a three element vector [p, q , r], where r=1 – (p+q). We could use three scalars to parameterize such Dirichlet and we may call them , , and ;

however, it does not scale well to higher dimensions, so we just use a vector named with lenght k, where k is the number of outcomes. Note that we can think of the beta and Dirichlet as distributions over probabilities. To get an idea about this distribution pay attention to the following figure and try to relate each triangular subplot to a beta distribution with similar parameters.

The preceding figure is the output of the code written by Thomas Boggs with just a few minor tweaks. You can find the code in the accompanying text; also check the Keep reading sections for details.

Now that we have a better grasp of the Dirichlet distribution we have all the elements to build our mixture model. One way to visualize it, is as a k-side coin flip model on top of a Gaussian estimation model. Of course, instead of k-sided coins

The rounded-corner box is indicating that we have k-Gaussian likelihoods (with their corresponding priors) and the categorical variables decide which of them we use to describe a given data point.

Remember, we are assuming we know the means and standard deviations of the Gaussians; we just need to assign each data point to one Gaussian. One detail of the following model is that we have used two samplers, Metropolis and ElemwiseCategorical, which is specially designed to sample discrete variables with pm.Model() as model_kg:

```
p = pm.Dirichlet('p', a=np.ones(clusters))
category = pm.Categorical('category', p=p, shape=n_total)
means = pm.math.constant([10, 20, 35])
y = pm.Normal('y', mu=means[category], sd=2, observed=mix)
step1 = pm.ElemwiseCategorical(vars=[category], values=range(clusters))
step2 = pm.Metropolis(vars=[p])
trace_kg = pm.sample(10000, step=[step1, step2])
chain_kg = trace_kg[1000:]
varnames_kg = ['p']
pm.traceplot(chain_kg, varnames_kg)
```

Now that we know the skeleton of a Gaussian mixture model, we are going to add a complexity layer and we are going to estimate the parameters of the Gaussians. We are going to assume three different means and a single shared standard deviation.

As usual, the model translates easily to the PyMC3 syntax.

```
with pm.Model() as model_ug:
p = pm.Dirichlet('p', a=np.ones(clusters))
category = pm.Categorical('category', p=p, shape=n_total)
means = pm.Normal('means', mu=[10, 20, 35], sd=2, shape=clusters)
sd = pm.HalfCauchy('sd', 5)
y = pm.Normal('y', mu=means[category], sd=sd, observed=mix)
step1 = pm.ElemwiseCategorical(vars=[category], values=range(clusters))
step2 = pm.Metropolis(vars=[means, sd, p])
trace_ug = pm.sample(10000, step=[step1, step2])
```

Now we explore the trace we got:

```
chain = trace[1000:]
varnames = ['means', 'sd', 'p']
pm.traceplot(chain, varnames)
```

And a tabulated summary of the inference:

`pm.df_summary(chain, varnames)`

mean | sd | mc_error | hpd_2.5 | hpd_97.5 | |

means__0 | 21.053935 | 0.310447 | 0.012280 | 20.495889 | 21.735211 |

means__1 | 35.291631 | 0.246817 | 0.008159 | 34.831048 | 35.781825 |

means__2 | 8.956950 | 0.235121 | 0.005993 | 8.516094 | 9.429345 |

sd | 2.156459 | 0.107277 | 0.002710 | 1.948067 | 2.368482 |

p__0 | 0.235553 | 0.030201 | 0.000793 | 0.179247 | 0.297747 |

p__1 | 0.349896 | 0.033905 | 0.000957 | 0.281977 | 0.412592 |

p__2 | 0.347436 | 0.032414 | 0.000942 | 0.286669 | 0.410189 |

Now we are going to do a predictive posterior check to see what our model learned from the data:

```
ppc = pm.sample_ppc(chain, 50, model)
for i in ppc['y']:
sns.kdeplot(i, alpha=0.1, color='b')
sns.kdeplot(np.array(mix), lw=2, color='k')
plt.xlabel('$x$', fontsize=14)
```

Notice how the uncertainty, represented by the lighter blue lines, is smaller for the smaller and larger values of and is higher around the central Gaussian. This makes intuitive sense since the regions of higher uncertainty correspond to the regions where the Gaussian overlaps and hence it is harder to tell if a point belongs to one or the other Gaussian. I agree that this is a very simple problem and not that much of a challenge, but it is a problem that contributes to our intuition and a model that can be easily applied or extended to more complex problems.

We saw how to build a gaussian mixture model using a very basic model as an example, which can be applied to solve more complex models.

*If you enjoyed this excerpt, check out the book **Bayesian Analysis with Python** to understand the Bayesian framework and solve complex statistical problems using Python.*