7 min read
In this tutorial, we will look at a useful framework for text mining, called topic models. We will apply the framework to the State of the Union addresses.
In building topic models, the number of topics must be determined before running the algorithm (k-dimensions). If no prior reason for the number of topics exists, then you can build several and apply judgment and knowledge to the final selection. There are different methods that come under Topic Modeling. We’ll look at LDA with Gibbs sampling. This method is quite complicated mathematically, but my intent is to provide an introduction so that you are at least able to describe how the algorithm learns to assign a document to a topic in layperson terms. If you are interested in mastering the math associated with the method, block out a couple of hours on your calendar and have a go at it. Excellent background material can be found here.
This tutorial is an excerpt taken from the book ‘Mastering Machine Learning with R – Third Edition’ written by Cory Lesmeister. The book explores expert techniques for solving data analytics and covers machine learning challenges that can help you gain insights from complex projects and power up your applications.
Talking about LDA or Latent Dirichlet Allocation in topic modeling, it is a generative process, and works in the following manner to iterate to a steady state:
- For each document (j), there are 1 to j documents. We will randomly assign a multinomial distribution (Dirichlet distribution) to the topics (k) with 1 to k topics, for example, document A is 25 percent topic one, 25 percent topic two, and 50 percent topic three.
- Probabilistically, for each word (i), there are 1 to i words to a topic (k); for example, the word mean has a probability of 0.25 for the topic statistics.
- For each word (i) in document (j) and topic (k), calculate the proportion of words in that document assigned to that topic; note it as the probability of topic (k) with document (j), p(k|j), and the proportion of word (i) in topic (k) from all the documents containing the word. Note it as the probability of word (i) with topic (k), p(i|k).
- Resample, that is, assign w a new t based on the probability that t contains w, which is based on p(k|j) times p(i|k).
- Rinse and repeat; over numerous iterations, the algorithm finally converges and a document is assigned a topic based on the proportion of words assigned to a topic in that document.
The LDA assumes that the order of words and documents does not matter. There has been work done to relax these assumptions in order to build models of language generation and sequence models over time (known as dynamic topic modeling or DTM).
Applying Topic models in State of the Union addresses
We will leave behind the 19th century and look at these recent times of trial and tribulation (1965 through 2016). On looking at this data, I found something interesting and troubling. Let’s take a look at the 1970s:
> sotu_meta[185:191, 1:4] # A tibble: 7 x 4 president year years_active party
1 Richard M. Nixon 1970 1969-1973 Republican 2 Richard M. Nixon 1971 1969-1973 Republican 3 Richard M. Nixon 1972 1969-1973 Republican 4 Richard M. Nixon 1972 1969-1973 Republican 5 Richard M. Nixon 1974 1973-1974 Republican 6 Richard M. Nixon 1974 1973-1974 Republican 7 Gerald R. Ford 1975 1974-1977 Republican
We see there are two 1972 and two 1974 addresses, but none for 1973. What? I went to the Nixon Foundation website, spent about 10 minutes trying to deconflict this, and finally threw my hands in the air and decided on implementing a quick fix. Be advised that there are a number of these conflicts to put in order:
> sotu_meta[188, 2] sotu_meta[190, 2] sotu_meta[157, 2] sotu_meta[166, 2] sotu_meta[170, 2] sotu_meta[176, 2] sotu_meta[195, 2] sotu_meta[197, 2] sotu_meta[199, 2] sotu_meta[201, 2]
An email to the author of this package is in order. I won’t bother with that, but feel free to solve the issue yourself.
With this tragedy behind us, we’ll go through tokenizing and removing stop words again for our relevant time frame:
> sotu_meta_recent % dplyr::filter(year > 1964) > sotu_meta_recent %>% tidytext::unnest_tokens(word, text) -> sotu_unnest_recent > sotu_recent % dplyr::anti_join(stop_words, by = "word")
As discussed previously, we need to put the data into a DTM before building a model. This is done by creating a word count grouped by year, then passing that to the cast_dtm() function:
> sotu_recent %>% dplyr::group_by(year) %>% dplyr::count(word) -> lda_words > sotu_dtm
Let’s get our model built. I’m going to create six different topics using the Gibbs method, and I specified verbose. It should run 2,000 iterations:
> sotu_lda sotu_lda A LDA_Gibbs topic model with 6 topics.
The algorithm gives each topic a number. We can see what year is mapped to what topic. I abbreviate the output since 2002:
> topicmodels::topics(sotu_lda) 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2 2 2 2 2 2 2 4 4 4 4 4 4 4 4
We see a clear transition between Bush and Obama from topic 2 to topic 4. Here is a table of the count of topics:
> table(topicmodels::topics(sotu_lda)) 1 2 3 4 5 6 8 7 5 18 14 5
Topic 4 is the most prevalent, which is associated with Clinton’s term also. This output gives us the top five words associated with each topic:
> topicmodels::terms(sotu_lda, 5) Topic 1 Topic 2 Topic 3 [1,] "future" "america" "administration" [2,] "tax" "security" "congress" [3,] "spending" "country" "economic" [4,] "government" "world" "legislation" [5,] "economic" "iraq" "energy" Topic 4 Topic 5 Topic 6 [1,] "people" "world" "federal" [2,] "american" "people" "programs" [3,] "jobs" "american" "government" [4,] "america" "congress" "program" [5,] "children" "peace" "act"
This all makes good sense, and topic 2 is spot on for the time. If you drill down further to, say, 10, 15, or 20 words, it is even more revealing, but I won’t bore you further. What about an application in the tidy ecosystem and a visualization? Certainly! We’ll turn the model object into a data frame first and in the process capture the per-topic-per-word probabilities called beta:
> lda_topics ap_top_terms % dplyr::group_by(topic) %>% dplyr::top_n(10, beta) %>% dplyr::ungroup() %>% dplyr::arrange(topic, -beta)
We can explore that data further or just plot it as follows:
> ap_top_terms %>% dplyr::mutate(term = reorder(term, beta)) %>% ggplot2::ggplot(ggplot2::aes(term, beta, fill = factor(topic))) + ggplot2::geom_col(show.legend = FALSE) + ggplot2::facet_wrap(~ topic, scales = "free") + ggplot2::coord_flip() + ggthemes::theme_economist_white()
The output of the preceding code is as follows:
This is the top 10 words per topic based on the beta probability. Another thing we can do is look at the probability an address is related to a topic. This is referred to as gamma in the model and we can pull those in just like the beta:
We now have the probabilities of an address per topic. Let’s look at the 1981 Ronald Reagan values:
> dplyr::filter(ap_documents, document == "1981") # A tibble: 6 x 3 document topic gamma
1 1981 1 0.286 2 1981 2 0.0163 3 1981 3 0.0923 4 1981 4 0.118 5 1981 5 0.0777 6 1981 6 0.411
Topic 1 is a close second in the topic race. If you think about it, this means that more than six topics would help to create better separation in the probabilities. However, I like just six topics for this tutorial for the purpose of demonstration.
In this tutorial, we looked at topic models in R. We applied the framework to the State of the Union addresses. If you want to stay updated with expert techniques for solving data analytics and explore other machine learning challenges in R, be sure to check out the book ‘Mastering Machine Learning with R – Third Edition’.