How to predict viral content using random forest regression in Python [Tutorial]

8 min read

Understanding sharing behavior is a big business. As consumers become blind to traditional advertising, the push is to go beyond simple pitches to tell engaging stories. In this article we will build a predictive content scoring model that will predict whether the content will go viral or not using random forest regression.

This article is an excerpt from a book written by Alexander T. Combs titled Python Machine Learning Blueprints: Intuitive data projects you can relate to.

You can download the code and other relevant files used in this article from this GitHub link.

What does research tell us about content virality?

Increasingly, the success of these endeavors is measured in social shares. Why go to so much trouble? Because as a brand, every share that I receive represents another consumer that I’ve reached—all without spending an additional cent.

Due to this value, several researchers have examined sharing behavior in the hopes of understanding what motivates it.

Among the reasons researchers have found:

  • To provide practical value to others (an altruistic motive)
  • To associate ourselves with certain ideas and concepts (an identity motive) To bond with others around a common emotion (a communal motive)

With regard to the last motive, one particularly well-designed study looked at the 7,000 pieces of content from the New York Times to examine the effect of emotion on sharing. They found that simple emotional sentiment was not enough to explain sharing behavior, but when combined with emotional arousal, the explanatory power was greater. For example, while sadness has a strong negative valence, it is considered to be a low arousal state. Anger, on the other hand, has a negative valence paired with a high arousal state. As such, stories that sadden the reader tend to generate far fewer stories than anger-inducing stories:

examine the effect of emotion on sharing

Source : “What Makes Online Content Viral?” by Jonah Berger and Katherine L. Milkman

Building a predictive content scoring model

Let’s create a model that can estimate the share counts for a given piece of content. Ideally, we would have a much larger sample of content, especially content that had more typical share counts. However, we’ll make do with what we have here.

We’re going to use an algorithm called random forest regression. Here we’re going to use a regression and attempt to predict the share counts. We could bucket our share classes into ranges, but it is preferable to use regression when dealing with continuous variables.

To begin, we’ll create a bare-bones model. We’ll use the number of images, the site, and the word count. We’ll train our model on the number of Facebook likes.

We’ll first import the sci-kit learn library, then we’ll prepare our data by removing the rows with nulls, resetting our index, and finally splitting the frame into our training and testing set:

from sklearn.ensemble import RandomForestRegressor all_data = dfc.dropna(subset=[‘img_count’, ‘word_count’]) all_data.reset_index(inplace=True, drop=True)

train_index = []

test_index = []

for i in all_data.index:

result = np.random.choice(2, p=[.65,.35])

if result == 1:




We used a random number generator with a probability set for approximately 2/3 and 1/3 to determine which row items (based on their index) would be placed in each set. Setting the probabilities this way ensures that we get approximately twice the number of rows in our training set as compared to the test set. We see this, as follows:

print(‘test length:’, len(test_index), ‘\ntrain length:’, len(train_index))

The preceding code will generate the following output:

random number generator

Now, we’ll continue on with preparing our data. Next, we need to set up categorical encoding for our sites. Currently, our DataFrame object has the name for each site represented with a string. We need to use dummy encoding. This creates a column for each site. If the row is for that particular site, then that column will be filled in with 1; all the other site columns be filled in with 0. Let’s do that now:

sites = pd.get_dummies(all_data[‘site’])


The preceding code will generate the following output:

categorical encoding for our sites

The dummy encoding can be seen in the preceding image.

We’ll now continue by splitting our data into training and test sets as follows:

y_train = all_data.iloc[train_index][‘fb’].astype(int)

X_train_nosite = all_data.iloc[train_index][[‘img_count’, ‘word_count’]]

X_train = pd.merge(X_train_nosite, sites.iloc[train_index],

left_index=True, right_index=True)

y_test = all_data.iloc[test_index][‘fb’].astype(int)

X_test_nosite = all_data.iloc[test_index][[‘img_count’, ‘word_count’]]

X_test = pd.merge(X_test_nosite, sites.iloc[test_index], left_index=True,


With this, we’ve set up our X_test, X_train, y_test, and y_train variables. We’ll use this now to build our model:

clf = RandomForestRegressor(n_estimators=1000), y_train)

With these two lines of code, we have trained our model. Let’s now use it to predict the Facebook likes for our testing set:

y_actual = y_test

deltas = pd.DataFrame(list(zip(y_pred, y_actual, (y_pred –

y_actual)/(y_actual))), columns=[‘predicted’, ‘actual’, ‘delta’])


The preceding code will generate the following output:

predict the Facebook likes

Here we see the predicted value, the actual value, and the difference as a percentage. Let’s take a look at the descriptive stats for this:


The preceding code will generate the following output:

descriptive stats

Our median error is 0! Well, unfortunately, this isn’t a particularly useful bit of information as errors are on both sides—positive and negative, and they tend to average out, which is what we see here. Let’s now look at a more informative metric to evaluate our model. We’re going to look at root mean square error as a percentage of the actual mean.

To first illustrate why this is more useful, let’s run the following scenario on two sample series:

a = pd.Series([10,10,10,10]) b = pd.Series([12,8,8,12]) np.sqrt(np.mean((b-a)**2))/np.mean(a)

This results in the following output:


Now compare this to the mean:


This results in the following output:

mean output

Clearly the former is the more meaningful statistic. Let’s now run this for our model:


The preceding code will generate the following output:

meaningful statistic

Let’s now add another feature that iscounts for words and see if it  helps our model. We’ll use a count vectorizer to do this. Much like what we did with the site names, we’ll transform individual words and n-grams into features:

from sklearn.feature_extraction.text import CountVectorizer vect = CountVectorizer(ngram_range=(1,3)) X_titles_all = vect.fit_transform(all_data[‘title’])

X_titles_train = X_titles_all[train_index] X_titles_test = X_titles_all[test_index]

X_test = pd.merge(X_test, pd.DataFrame(X_titles_test.toarray(), index=X_test.index), left_index=True, right_index=True)

X_train = pd.merge(X_train, pd.DataFrame(X_titles_train.toarray(), index=X_train.index), left_index=True, right_index=True)

In these lines, we joined our existing features to our new n-gram features. Let’s now train our model and see if we have any improvement:, y_train)

y_pred = clf.predict(X_test)

deltas = pd.DataFrame(list(zip(y_pred, y_actual, (y_pred –

y_actual)/(y_actual))), columns=[‘predicted’, ‘actual’, ‘delta’])


The preceding code will generate the following output:

new n-gram features

While checking our errors again, we see the following:


This code results in the following output:


So, it appears that we have a modestly improved model. Now, let’s add another feature i.e the word count of the title, as follows:

all_data = all_data.assign(title_wc = all_data[‘title’].map(lambda x:

len(x.split(‘ ‘))))

X_train = pd.merge(X_train, all_data[[‘title_wc’]], left_index=True,


X_test = pd.merge(X_test, all_data[[‘title_wc’]], left_index=True,

right_index=True), y_train) y_pred = clf.predict(X_test)


The preceding code will generate the following output:


It appears that each feature has modestly improved our model. There are certainly more features that we could add to our model. For example, we could add the day of the week and the hour of the posting, we could determine if the article is a listicle by running a regex on the headline, or we could examine the sentiment of each article. This only begins to touch on the features that could be important to model virality. We would certainly need to go much further to continue reducing the error in our model.

We have performed only the most cursory testing of our model. Each measurement should be run multiple times to get a more accurate representation of the true error rate. It is possible that there is no statistically discernible difference between our last two models, as we only performed one test.

To summarize, we learned how we can build a model to predict content virality using a random forest regression. To know more about predicting and other machine learning projects in Python projects check out Python Machine Learning Blueprints: Intuitive data projects you can relate to.

Read next

Writing web services with functional Python programming [Tutorial]

Visualizing data in R and Python using Anaconda [Tutorial]

Python 3.7 beta is available as the second generation Google App Engine standard runtime


Please enter your comment!
Please enter your name here