In this article by David Bellot, author of the book, Learning Probabilistic Graphical Models in R, explains that among all the predictions that were made about the 21st century, we may not have expected that we would collect such a formidable amount of data about everything, everyday, and everywhere in the world. The past years have seen an incredible explosion of data collection about our world and lives, and technology is the main driver of what we can certainly call a revolution. We live in the age of information. However, collecting data is nothing if we don’t exploit it and if we don’t try to extract knowledge out of it.
At the beginning of the 20th century, with the birth of statistics, the world was all about collecting data and making statistics. Back then, the only reliable tools were pencils and papers and, of course, the eyes and ears of the observers. Scientific observation was still in its infancy despite the prodigious development of the 19th century.
(For more resources related to this topic, see here.)
More than a hundred years later, we have computers, electronic sensors, massive data storage, and we are able to store huge amounts of data continuously, not only about our physical world but also about our lives, mainly through the use of social networks, Internet, and mobile phones. Moreover, the density of our storage technology increased so much that we can, nowadays, store months if not years of data into a very small volume that can fit in the palm of our hand.
Among all the tools and theories that have been developed to analyze, understand, and manipulate probability and statistics became one of the most used. In this field, we are interested in a special, versatile, and powerful class of models called the probabilistic graphical models (PGM, for short).
Probabilistic graphical model is a tool to represent beliefs and uncertain knowledge about facts and events using probabilities. It is also one of the most advanced machine learning techniques nowadays and has many industrial success stories. They can deal with our imperfect knowledge about the world because our knowledge is always limited. We can’t observe everything, and we can’t represent the entire universe in a computer. We are intrinsically limited as human beings and so are our computers. With Probabilistic Graphical Models, we can build simple learning algorithms or complex expert systems. With new data, we can improve these models and refine them as much as we can, and we can also infer new information or make predictions about unseen situations and events. Probabilistic Graphical Models, seen from the point of view of mathematics, are a way to represent a probability distribution over several variables, which is called a joint probability distribution.
In a PGM, such knowledge between variables can be represented with a graph, that is, nodes connected by edges with a specific meaning associated to it. Let’s consider an example from the medical world: how to diagnose a cold. This is an example and by no means a medical advice. It is oversimplified for the sake of simplicity. We define several random variables such as the following:
- Se: This means season of the year
- N: This means that the nose is blocked
- H: This means the patient has a headache
- S: This means that the patient regularly sneezes
- C: This means that the patient coughs
- Cold: This means the patient has a cold.
Because each of the symptoms can exist at different degrees, it is natural to represent the variable as random variables. For example, if the patient’s nose is a bit blocked, we will assign a probability of, say, 60% to this variable, that is P(N=blocked)=0.6 and P(N=not blocked)=0.4.
In this example, the probability distribution P(Se,N,H,S,C,Cold) will require 4 * 25 = 128 values in total (4 values for season and 2 values for each of the other random variables). It’s quite a lot, and honestly, it’s quite difficult to determine things such as the probability that the nose is not blocked, the patient has a headache, the patient sneeze, and so on. However, we can say that a headache is not directly related to cough of a blocked nose, expect when the patient has a cold. Indeed, the patient can have a headache for many other reasons. Moreover, we can say that the Season has quite a direct effect on Sneezing, blocked nose, or Cough but less or no direct effect on Headache. In a Probabilistic Graphical Model, we will represent these dependency relationships with a graph, as follow, where each random variable is a node in the graph, and each relationship is an arrow between 2 nodes:
In the graph that follows, there is a direct relation between each node and each variable of the Probabilistic Graphical Model and also a direct relation between arrows and the way we can simplify the joint probability distribution in order to make it tractable.
Using a graph as a model to simplify a complex (and sometimes complicated) distribution presents numerous benefits:
- As we observed in the previous example, and in general when we model a problem, the random variables interacts directly with only a small subsets of other random variables. Therefore, this promotes more compact and tractable models
- The knowledge and dependencies represented in a graph are easy to understand and communicate
- The graph induces a compact representation of the joint probability distribution and it is easy to make computations with
- Algorithms to draw inferences and learn can use the graph theory and the associated algorithms to improve and facilitate all the inference and learning algorithms. Compared to the raw joint probability distribution, using a PGM will speed up computations by several order of magnitude.
The junction tree algorithm
The Junction Tree Algorithm is one of the main algorithms to do inference on PGM. Its name arises from the fact that before doing the numerical computations, we will transform the graph of the PGM into a tree with a set of properties that allow for efficient computations of posterior probabilities.
One of the main aspects is that this algorithm will not only compute the posterior distribution of the variables in the query, but also the posterior distribution of all other variables that are not observed. Therefore, for the same computational price, one can have any posterior distribution.
Implementing a junction tree algorithm is a complex task, but fortunately, several R packages contain a full implementation, for example, gRain.
Let’s say we have several variables A, B, C, D, E,and F. We will consider for the sake of simplicity that each variable is binary so that we won’t have too many values to deal with. We will assume the following factorization:
This is represented by the following graph:
We first start by loading the gRain package into R:
Then, we create our set of random variables from A to F:
val=c(“true”,”false”) F = cptable(~F, values=c(10,90),levels=val) C = cptable(~C|F, values=c(10,90,20,80),levels=val) E = cptable(~E|F, values=c(50,50,30,70),levels=val) A = cptable(~A|C, values=c(50,50,70,30),levels=val) D = cptable(~D|E, values=c(60,40,70,30),levels=val) B = cptable(~B|A:D, values=c(60,40,70,30,20,80,10,90),levels=val)
The cptable function creates a conditional probability table, which is a factor for discrete variables. The probabilities associated to each variable are purely subjective and only serve the purpose of the example. The next step is to compute the junction tree. In most packages, computing the junction tree is done by calling one function because the algorithm just does everything at once:
plist = compileCPT(list(F,E,C,A,D,B)) plist
Also, we check whether the list of variable is correctly compiled into a probabilistic graphical model and we obtain from the previous code:
CPTspec with probabilities: P( F ) P( E | F ) P( C | F ) P( A | C ) P( D | E ) P( B | A D )
This is indeed the factorization of our distribution, as stated earlier. If we want to check further, we can look at the conditional probability table of a few variables:
print(plist$F) print(plist$B) F true false 0.1 0.9 , , D = true A B true false true 0.6 0.7 false 0.4 0.3 , , D = false A B true false true 0.2 0.1 false 0.8 0.9
The second output is a bit more complex, but if you look carefully, you will see that you have two distributions, P(B|A,D=true) and P(B|A,D=false) which is more readable presentation of P(B|A,D).
We finally create the graph and run the junction tree algorithm by calling this:
jtree = grain(plist)
Again, when we check the result, we obtain:
jtree Independence network: Compiled: FALSE Propagated: FALSE Nodes: chr [1:6] "F" "E" "C" "A" "D" "B"
We only need to compute the junction tree once. Then, all queries can be computed with the same junction tree. Of course, if you change the graph, then you need to recompute the junction tree.
Let’s perform a few queries:
querygrain(jtree, nodes=c("F"), type="marginal") $F F true false 0.1 0.9
Of course, if you ask for the marginal distribution of F, you will obtain the initial conditional probability table because F has no parents.
querygrain(jtree, nodes=c("C"), type="marginal") $C C true false 0.19 0.81
This is more interesting because it computes the marginal of C while we only stated the conditional distribution of C given F. We didn’t need to have such a complex algorithm as the junction tree algorithm to compute such a small marginal. We saw the variable elimination algorithm earlier and that would be enough too.
But if you ask for the marginal of B, then the variable elimination will not work because of the loop in the graph. However, the junction tree will give the following:
querygrain(jtree, nodes=c("B"), type="marginal") $B B true false 0.478564 0.521436
And, can ask more complex distribution, such as the joint distribution of B and A:
querygrain(jtree, nodes=c("A","B"), type="joint") B A true false true 0.309272 0.352728 false 0.169292 0.168708 In fact, any combination can be given like A,B,C: querygrain(jtree, nodes=c("A","B","C"), type="joint") , , B = true A C true false true 0.044420 0.047630 false 0.264852 0.121662 , , B = false A C true false true 0.050580 0.047370 false 0.302148 0.121338
Now, we want to observe a variable and compute the posterior distribution. Let’s say F=true and we want to propagate this information down to the rest of the network:
jtree2 = setEvidence(jtree, evidence=list(F="true"))
We can ask for any joint or marginal now:
querygrain(jtree, nodes=c("A"), type="marginal") $A A true false 0.662 0.338 querygrain(jtree2, nodes=c("A"), type="marginal") $A A true false 0.68 0.32
Here, we see that knowing that F=true changed the marginal distribution on A from its previous marginal (the second query is again with jtree2, the tree with an evidence). And, we can query any other variable:
querygrain(jtree, nodes=c("B"), type="marginal") $B B true false 0.478564 0.521436 querygrain(jtree2, nodes=c("B"), type="marginal") $B B true false 0.4696 0.5304
Building a Probabilistic Graphical Model, generally, requires three steps: defining the random variables, which are the nodes of the graph as well; defining the structure of the graph; and finally defining the numerical parameters of each local distribution. So far, the last step has been done manually and we gave numerical values to each local probability distribution by hand. In many cases, we have access to a wealth of data and we can find the numerical values of those parameters with a method called parameters learning. In other fields, it is also called parameters fitting or model calibration.
Learning parameters can be done with several approaches and there is no ultimate solution to the problem because it depends on the goal where the model’s user wants to reach. Nevertheless, it is common to use the notion of Maximum Likelihood of a model and also Maximum A Posteriori. As you are now used to the notion of prior and posterior of a distribution, you can already guess what a maximum a posteriori can do.
Many algorithms are used, among which we can cite the Expectation Maximization algorithm (EM), which computes the maximum likelihood of a model even when data is missing or variables are not observed at all. It is a very important algorithm, especially for mixture models.
A graphical model of a linear model
PGM can be used to represent standard statistical models and then extend them. One famous example is the linear regression mode. We can visualize the structure of a linear mode and better understand the relationships between the variable. The linear model captures the relationships between observable variables xand a target variable y. This relation is modeled by a set of parameters, θ. But remember the distribution of y for each data point indexed by i:
Here, Xiis a row vector for which the first element is always one to capture the intercept of the linear model. The parameter θ in the following graph is itself composed of the intercept, the coefficient β for each component of X, and the variance σ2 of in the distribution of yi.
The PGM for an observation of a linear model can be represented as follows:
So, this decomposition leads us to a second version of the graphical model in which we explicitly separate the components of θ:
In a PGM, when a rectangle is drawn around a set of nodes with a number or variables in a corner (N for example), it means that the same graph is repeated many times. The likelihood function of a linear model is , and it can be represented as a PGM. And, the vector β can also be decomposed it into its univariate components too:
In this last iterations of the graphical model, we see that the parameters β could have a prior probability on it instead of being fixed. In fact, the parameter can also be considered as a random variable. For the time being, we will keep it fixed.
Latent Dirichlet Allocation
The last model we want to show in this article is called the Latent Dirichlet Allocation. It is a generative model that can be represented as a graphical model. It’s based on the same idea as the mixture model with one notable exception. In this model, we assume that the data points might be generated by a combination of clusters and not just one cluster at a time, as it was the case before.
The LDA model is primarily used in text analysis and classification. Let’s consider that a text document is composed of words making sentences and paragraphs. To simplify the problem we can say that each sentence or paragraph is about one specific topic, such as science, animals, sports, and s on. Topics can also be more specific, such as cat topic or European soccer topic. Therefore, there are words that are more likely to come from specific topics. For example, the work cat is likely to come from the topic cat topic. The word stadium is likely to come from the topic European soccer. However, the word ball should come with a higher probability from the topic European soccer, but it is not unlikely to come from the topic cat, because cats like to play with balls too. So, it seems the word ball might belong to two topics at the same time with a different degree of certainty. Other words such as table will certainly belong equally to both topics and presumably to others. They are very generic; expect, of course, if we introduce another topics such as furniture.
A document is a collection of words, so a document can have complex relationships with a set of topics. But in the end, it is more likely to see words coming from the same topic or the same topics within a paragraph and to some extent to the document.
In general, we model a document with a bag of words model, that is, we consider a document to be a randomly generated set of words, using a specific distribution over the words. If this distribution is uniform over all the words, then the document will be purely random without a specific meaning. However, if this distribution has a specific form, with more probability mass to related words, then the collection of words generated by this model will have a meaning. Of course, generating documents is not really the application we have in mind for such a model. What we are interested in is the analysis of documents, their classification, and automatic understanding.
Let’s say is a categorical variable (in other words, a histogram), representing the probability of appearance of all words from a dictionary. Usually, in this kind of model, we restrict ourselves to long words only and remove the small words, like and, to, but, the, a, and so onThese words are usually called stop words. Let w_jbe the jth words in a document. The following three graphs show the progression from representing a document (left-most graph) to representing a collection of documents (the third graph):
Let be a distribution over topics, then in the second graph from the left, we extend this model by choosing the kind of topic that will be selected at any time and then generate a word out of it. Therefore, the variable zi now becomes the variable zij, that is, the topic iis selected for the word j. We can go even further and decide that we want to model a collection of documents, which seems natural if we consider that we have a big data set. Assuming that documents are i.i.d, the next step (the third graph) is a PGM that represents the generative model for M documents. And, because the distribution on is categorical, we want to be Bayesian about it, mainly because it will help to model not to overfit and because we consider the selection of topics for a document to be a random process. Moreover, we want to apply the same treatment to the word variable by having a Dirichlet prior. This prior is used to avoid non-observed words that have zero probability. It smooths the distribution of words per topic. A uniform Dirichlet prior will induce a uniform prior distribution on all the words. And therefore, the final graph on the right represents the complete model.
This is quite a complex graphical model but techniques have been developed to fit the parameters and use this model. If we follow this graphical model carefully, we have a process that generates documents based on a certain set of topics:
- α chooses the set of topics for a documents
- From θ, we generate a topic zij
- From this topic, we generate a word wj
In this model, only the words are observable. All the other variables will have to be determined without observation, exactly like in the other mixture models. So, documents are represented as random mixtures over latent topics, in which each topic is represented as a distribution over words.
The distribution of a topic mixture based on this graphical mode can be written as follows:
You can see in this formula that for each word, we select a topic, hence the product from 1 to N.
Integrating over θ and summing over z, the marginal distribution of a document is as follows:
The final distribution can be obtained by taking the product of marginal distributions of single documents, so as to get the distribution over a collection of documents (assuming that documents are independently and identically distributed). Here, D is the collection of documents:
The main problem to be solved now is how to compute the posterior distribution over θ and z, given a document. By applying the Bayes formula, we know the following:
Unfortunately, this is intractable because of the normalization factor at the denominator. The original paper on LDA, therefore, refers to a technique called Variational inference, which aims at transforming a complex Bayesian inference problem into a simpler approximation which can be solved as an (convex) optimization problem. This technique is the third approach to Bayesian inference and has been used on many other problems.
The probabilistic graphical model framework offers a powerful and versatile framework to develop and extend many probabilistic models using an elegant graph-based formalism. It has many applications such as in biology, genomics, medicine, finance, robotics, computer vision, automation, engineering, law, and games, for example. Many packages in R exist to deal with all sort of models and data among which gRain or Rstan are very popular.
Resources for Article:
- Extending ElasticSearch with Scripting [article]
- Exception Handling in MySQL for Python [article]
- Breaking the Bank [article]