Home Tutorials How to build topic models in R

Tutorials

How to build topic models in R [Tutorial]

April 22, 2019 - 4:00 am

7545

7 min read

Topic models are a powerful method to group documents by their main topics. Topic models allow probabilistic modeling of term frequency occurrence in documents. The fitted model can be used to estimate the similarity between documents, as well as between a set of specified keywords using an additional layer of latent variables, which are referred to as topics (Grun and Hornik, 2011). In essence, a document is assigned to a topic based on the distribution of the words in that document, and the other documents in that topic will have roughly the same frequency of words.

In this tutorial, we will look at a useful framework for text mining, called topic models. We will apply the framework to the State of the Union addresses.

In building topic models, the number of topics must be determined before running the algorithm (k-dimensions). If no prior reason for the number of topics exists, then you can build several and apply judgment and knowledge to the final selection. There are different methods that come under Topic Modeling. We’ll look at LDA with Gibbs sampling. This method is quite complicated mathematically, but my intent is to provide an introduction so that you are at least able to describe how the algorithm learns to assign a document to a topic in layperson terms. If you are interested in mastering the math associated with the method, block out a couple of hours on your calendar and have a go at it. Excellent background material can be found here.

This tutorial is an excerpt taken from the book ‘Mastering Machine Learning with R – Third Edition’ written by Cory Lesmeister. The book explores expert techniques for solving data analytics and covers machine learning challenges that can help you gain insights from complex projects and power up your applications.

Talking about LDA or Latent Dirichlet Allocation in topic modeling, it is a generative process, and works in the following manner to iterate to a steady state:

For each document (j), there are 1 to j documents. We will randomly assign a multinomial distribution (Dirichlet distribution) to the topics (k) with 1 to k topics, for example, document A is 25 percent topic one, 25 percent topic two, and 50 percent topic three.
Probabilistically, for each word (i), there are 1 to i words to a topic (k); for example, the word mean has a probability of 0.25 for the topic statistics.
For each word (i) in document (j) and topic (k), calculate the proportion of words in that document assigned to that topic; note it as the probability of topic (k) with document (j), p(k|j), and the proportion of word (i) in topic (k) from all the documents containing the word. Note it as the probability of word (i) with topic (k), p(i|k).
Resample, that is, assign w a new t based on the probability that t contains w, which is based on p(k|j) times p(i|k).
Rinse and repeat; over numerous iterations, the algorithm finally converges and a document is assigned a topic based on the proportion of words assigned to a topic in that document.

The LDA assumes that the order of words and documents does not matter. There has been work done to relax these assumptions in order to build models of language generation and sequence models over time (known as dynamic topic modeling or DTM).

Applying Topic models in State of the Union addresses

We will leave behind the 19th century and look at these recent times of trial and tribulation (1965 through 2016). On looking at this data, I found something interesting and troubling. Let’s take a look at the 1970s:

> sotu_meta[185:191, 1:4]
# A tibble: 7 x 4
  president        year  years_active party 
  <chr>            <int> <chr>        <chr> 
1 Richard M. Nixon 1970  1969-1973    Republican
2 Richard M. Nixon 1971  1969-1973    Republican
3 Richard M. Nixon 1972  1969-1973    Republican
4 Richard M. Nixon 1972  1969-1973    Republican
5 Richard M. Nixon 1974  1973-1974    Republican
6 Richard M. Nixon 1974  1973-1974    Republican
7 Gerald R.   Ford 1975  1974-1977    Republican

We see there are two 1972 and two 1974 addresses, but none for 1973. What? I went to the Nixon Foundation website, spent about 10 minutes trying to deconflict this, and finally threw my hands in the air and decided on implementing a quick fix. Be advised that there are a number of these conflicts to put in order:

> sotu_meta[188, 2] <- "1972_2"

> sotu_meta[190, 2] <- "1974_2"

> sotu_meta[157, 2] <- "1945_2"

> sotu_meta[166, 2] <- "1953_2"

> sotu_meta[170, 2] <- "1956_2"

> sotu_meta[176, 2] <- "1961_2"

> sotu_meta[195, 2] <- "1978_2"

> sotu_meta[197, 2] <- "1979_2"

> sotu_meta[199, 2] <- "1980_2"

> sotu_meta[201, 2] <- "1981_2"

An email to the author of this package is in order. I won’t bother with that, but feel free to solve the issue yourself.

With this tragedy behind us, we’ll go through tokenizing and removing stop words again for our relevant time frame:

> sotu_meta_recent <- sotu_meta %>%
    dplyr::filter(year > 1964)

> sotu_meta_recent %>%
    tidytext::unnest_tokens(word, text) -> sotu_unnest_recent

> sotu_recent <- sotu_unnest_recent %>%
    dplyr::anti_join(stop_words, by = "word")

As discussed previously, we need to put the data into a DTM before building a model. This is done by creating a word count grouped by year, then passing that to the cast_dtm() function:

> sotu_recent %>%
    dplyr::group_by(year) %>%
    dplyr::count(word) -> lda_words

> sotu_dtm <- tidytext::cast_dtm(lda_words, year, word, n)

Let’s get our model built. I’m going to create six different topics using the Gibbs method, and I specified verbose. It should run 2,000 iterations:

> sotu_lda <-
 topicmodels::LDA(
 sotu_dtm,
 k = 6,
 method = "Gibbs",
 control = list(seed = 1965, verbose = 1)
 )

> sotu_lda
A LDA_Gibbs topic model with 6 topics.

The algorithm gives each topic a number. We can see what year is mapped to what topic. I abbreviate the output since 2002:

> topicmodels::topics(sotu_lda)
2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
   2    2    2    2    2    2    2    4    4    4    4    4    4    4    4

We see a clear transition between Bush and Obama from topic 2 to topic 4. Here is a table of the count of topics:

> table(topicmodels::topics(sotu_lda))

 1 2 3  4  5 6 
 8 7 5 18 14 5

Topic 4 is the most prevalent, which is associated with Clinton’s term also. This output gives us the top five words associated with each topic:

> topicmodels::terms(sotu_lda, 5)
     Topic 1      Topic 2    Topic 3 
[1,] "future"     "america"  "administration"
[2,] "tax"        "security" "congress"
[3,] "spending"   "country"  "economic"
[4,] "government" "world"    "legislation"
[5,] "economic"   "iraq"     "energy"

      Topic 4     Topic 5    Topic 6 
[1,] "people"     "world"    "federal"
[2,] "american"   "people"   "programs"
[3,] "jobs"       "american" "government"
[4,] "america"    "congress" "program"
[5,] "children"   "peace"    "act"

This all makes good sense, and topic 2 is spot on for the time. If you drill down further to, say, 10, 15, or 20 words, it is even more revealing, but I won’t bore you further. What about an application in the tidy ecosystem and a visualization? Certainly! We’ll turn the model object into a data frame first and in the process capture the per-topic-per-word probabilities called beta:

> lda_topics <- tidytext::tidy(sotu_lda, matrix = "beta")

> ap_top_terms <- lda_topics %>%
    dplyr::group_by(topic) %>%
    dplyr::top_n(10, beta) %>%
    dplyr::ungroup() %>%
    dplyr::arrange(topic, -beta)

We can explore that data further or just plot it as follows:

> ap_top_terms %>%
    dplyr::mutate(term = reorder(term, beta)) %>%
    ggplot2::ggplot(ggplot2::aes(term, beta, fill = factor(topic))) +
    ggplot2::geom_col(show.legend = FALSE) +
    ggplot2::facet_wrap(~ topic, scales = "free") +
    ggplot2::coord_flip() +
    ggthemes::theme_economist_white()

The output of the preceding code is as follows:

This is the top 10 words per topic based on the beta probability. Another thing we can do is look at the probability an address is related to a topic. This is referred to as gamma in the model and we can pull those in just like the beta:

> ap_documents <- tidytext::tidy(sotu_lda, matrix = "gamma")

We now have the probabilities of an address per topic. Let’s look at the 1981 Ronald Reagan values:

> dplyr::filter(ap_documents, document == "1981")
# A tibble: 6 x 3
  document topic gamma
  <chr>    <int> <dbl>
1 1981      1    0.286 
2 1981      2    0.0163
3 1981      3    0.0923
4 1981      4    0.118 
5 1981      5    0.0777
6 1981      6    0.411

Topic 1 is a close second in the topic race. If you think about it, this means that more than six topics would help to create better separation in the probabilities. However, I like just six topics for this tutorial for the purpose of demonstration.

In this tutorial, we looked at topic models in R. We applied the framework to the State of the Union addresses. If you want to stay updated with expert techniques for solving data analytics and explore other machine learning challenges in R, be sure to check out the book ‘Mastering Machine Learning with R – Third Edition’.

Top 6 Cybersecurity Books from Packt to Accelerate Your Career

Your Quick Introduction to Extended Events in Analysis Services from Blog…

Logging the history of my past SQL Saturday presentations from Blog…

Storage savings with Table Compression from Blog Posts – SQLServerCentral

Daily Coping 31 Dec 2020 from Blog Posts – SQLServerCentral

Learning Essential Linux Commands for Navigating the Shell Effectively

Exploring the Strategy Behavioral Design Pattern in Node.js

How to integrate a Medium editor in Angular 8

Implementing memory management with Golang’s garbage collector

How to create sales analysis app in Qlik Sense using DAR…

How to build topic models in R [Tutorial]

Applying Topic models in State of the Union addresses

Read Next

Interviews

Learning Essential Linux Commands for Navigating the Shell Effectively

Exploring Forms in Angular – types, benefits and differences

Gain Practical Expertise with the Latest Edition of Software Architecture with C# 9...

Exploring the Strategy Behavioral Design Pattern in Node.js

Giving material.angular.io a refresh from Angular Blog – Medium

Popular on Packt Hub

How to use arrays, lists, and dictionaries in Unity for 3D...

Customizing Elgg Themes

Using Python Automation to interact with network devices [Tutorial]

Basics of Jupyter Notebook and Python

OpenCV: Detecting Edges, Lines, and Shapes

MobilePro

datapro

Programming

Subscribe to our newsletter