Mining Twitter with Python – Influence and Engagement

July 11, 2016 - 12:00 am

2443

9 min read

In this article by Marco Bonzanini, author of the book Mastering Social Media Mining with Python, we will discussmining Twitter data. Here, we will analyze users, their connections, and their interactions.

In this article, we will discuss how to measure influence and engagement on Twitter.

Learn Programming & Development with a Packt Subscription

(For more resources related to this topic, see here.)

Measuring influence and engagement

One of the most commonly mentioned characters in the social media arena is the mythical influencer. This figure is responsible for a paradigm shift in the recent marketing strategies (https://en.wikipedia.org/wiki/Influencer_marketing), which focus on targeting key individuals rather than the market as a whole.

Influencers are typically active users within their community.In case of Twitter, an influencer tweets a lot about topics they care about. Influencers are well connected as they follow and are followed by many other users who are also involved in the community. In general, an influencer is also regarded as an expert in their area, and is typically trusted by other users.

This description should explain why influencers are an important part of recent trends in marketing: an influencer can increase awareness or even become an advocate of a specific product or brand and can reach a vast number of supporters.

Whether your main interest is Python programming or wine tasting, regardless how huge (or tiny) your social network is, you probably already have an idea who the influencers in your social circles are: a friend, acquaintance, or random stranger on the Internet whose opinion you trust and value because of their expertise on the given subject.

A different, but somehow related, concept is engagement. User engagement, or customer engagement, is the assessment of the response to a particular offer, such as a product or service. In the context of social media, pieces of content are often created with the purpose to drive traffic towards the company website or e-commerce. Measuring engagement is important as it helps in defining and understanding strategies to maximize the interactions with your network, and ultimately bring business. On Twitter, users engage by the means of retweeting or liking a particular tweet, which in return, provides more visibility to the tweet itself.

In this section, we’ll discuss some interesting aspects of social media analysis regarding the possibility of measuring influence and engagement. On Twitter, a natural thought would be to associate influence with the number of users in a particular network. Intuitively, a high number of followers means that a user can reach more people, but it doesn’t tell us how a tweet is perceived.

The following script compares some statistics for two user profiles:

import sys

import json

 

def usage():

  print("Usage:")

  print("python {} <username1><username2>".format(sys.argv[0]))

 

if __name__ == '__main__':

  if len(sys.argv) != 3:

    usage()

    sys.exit(1)

  screen_name1 = sys.argv[1]

  screen_name2 = sys.argv[2]

After reading the two screen names from the command line, we will build up a list of followersfor each of them, including their number of followers to calculate the number of reachable users:

followers_file1 = 'users/{}/followers.jsonl'.format(screen_name1)

  followers_file2 = 'users/{}/followers.jsonl'.format(screen_name2)

  with open(followers_file1) as f1, open(followers_file2) as f2:

    reach1 = []

    reach2 = []

    for line in f1:

      profile = json.loads(line)

      reach1.append((profile['screen_name'], profile['followers_count']))

    for line in f2:

      profile = json.loads(line)

      reach2.append((profile['screen_name'],profile['followers_count']))

We will then load some basic statistics (followers and statuses count) from the two user profiles:

profile_file1 = 'users/{}/user_profile.json'.format(screen_name1)

  profile_file2 = 'users/{}/user_profile.json'.format(screen_name2)

  with open(profile_file1) as f1, open(profile_file2) as f2:

    profile1 = json.load(f1)

    profile2 = json.load(f2)

    followers1 = profile1['followers_count']

    followers2 = profile2['followers_count']

    tweets1 = profile1['statuses_count']

    tweets2 = profile2['statuses_count']

 

  sum_reach1 = sum([x[1] for x in reach1])

  sum_reach2 = sum([x[1] for x in reach2])

  avg_followers1 = round(sum_reach1 / followers1, 2)

  avg_followers2 = round(sum_reach2 / followers2, 2)

We will also load the timelines for the two users, in particular, to observe the number of times their tweets have been favorited or retweeted:

timeline_file1 = 'user_timeline_{}.jsonl'.format(screen_name1)

  timeline_file2 = 'user_timeline_{}.jsonl'.format(screen_name2)

  with open(timeline_file1) as f1, open(timeline_file2) as f2:

    favorite_count1, retweet_count1 = [], []

    favorite_count2, retweet_count2 = [], []

    for line in f1:

      tweet = json.loads(line)

      favorite_count1.append(tweet['favorite_count'])

      retweet_count1.append(tweet['retweet_count'])

    for line in f2:

      tweet = json.loads(line)

      favorite_count2.append(tweet['favorite_count'])

      retweet_count2.append(tweet['retweet_count'])

The preceding numbers are then aggregated into average number of favorites and average number of retweets, both in absolute terms and per number of followers:

avg_favorite1 = round(sum(favorite_count1) / tweets1, 2)

  avg_favorite2 = round(sum(favorite_count2) / tweets2, 2)

  avg_retweet1 = round(sum(retweet_count1) / tweets1, 2)

  avg_retweet2 = round(sum(retweet_count2) / tweets2, 2)

  favorite_per_user1 = round(sum(favorite_count1) / followers1, 2)

  favorite_per_user2 = round(sum(favorite_count2) / followers2, 2)

  retweet_per_user1 = round(sum(retweet_count1) / followers1, 2)

  retweet_per_user2 = round(sum(retweet_count2) / followers2, 2)

  print("----- Stats {} -----".format(screen_name1))

  print("{} followers".format(followers1))

  print("{} users reached by 1-degree connections".format(sum_reach1))

  print("Average number of followers for {}'s followers: {}".format(screen_name1, avg_followers1))

  print("Favorited {} times ({} per tweet, {} per user)".format(sum(favorite_count1), avg_favorite1, favorite_per_user1))

  print("Retweeted {} times ({} per tweet, {} per user)".format(sum(retweet_count1), avg_retweet1, retweet_per_user1))

  print("----- Stats {} -----".format(screen_name2))

  print("{} followers".format(followers2))

  print("{} users reached by 1-degree connections".format(sum_reach2))

  print("Average number of followers for {}'s followers: {}".format(screen_name2, avg_followers2))

  print("Favorited {} times ({} per tweet, {} per user)".format(sum(favorite_count2), avg_favorite2, favorite_per_user2))

  print("Retweeted {} times ({} per tweet, {} per user)".format(sum(retweet_count2), avg_retweet2, retweet_per_user2))

This script takes two arguments from the command line and assumes that the data has already been downloaded. In particular, for both users, we need the data about followers and the respective user timelines.

The script is somehow verbose, because it computes the same operations for two profiles and prints everything on the terminal. We can break it down into different parts.

Firstly, we will look into the followers’ followers. This will provide some information related to the part of the network immediately connected to the given user. In other words, it should answer the question how many users can I reach if all my followers retweet me? We can achieve this by reading the users/<user>/followers.jsonl file and keeping a list of tuples, where each tuple represents one of the followers and is in the (screen_name, followers_count)form. Keeping the screen name at this stage is useful in case we want to observe who the users with the highest number of followers are (not computed in the script, but easy to produce using sorted()).

In the second step, we will read the user profile from the users/<user>/user_profile.jsonfile so that we can get information about the total number of followers and the total number of tweets. With the data collected so far, we can compute the total number of users who are reachable within a degree of separation (follower of a follower) and the average number of followers of a follower. This is achieved via the following lines:

sum_reach1 = sum([x[1] for x in reach1])

avg_followers1 = round(sum_reach1 / followers1, 2)

The first one uses a list comprehension to iterate through the list of tuples mentioned previously, while the second one is a simple arithmetic average, rounded to two decimal points.

The third part of the script reads the user timeline from the user_timeline_<user>.jsonlfile and collects information about the number of retweets and favorite for each tweet. Putting everything together allows us to calculate how many times a user has been retweeted or favorited and what is the average number of retweet/favorite per tweet and follower.

To provide an example, I’ll perform some vanity analysis and compare my account,@marcobonzanini, with Packt Publishing:

$ python twitter_influence.py marcobonzanini PacktPub

The script produces the following output:

----- Stats marcobonzanini -----

282 followers

1411136 users reached by 1-degree connections

Average number of followers for marcobonzanini's followers: 5004.03

Favorited 268 times (1.47 per tweet, 0.95 per user)

Retweeted 912 times (5.01 per tweet, 3.23 per user)

----- Stats PacktPub -----

10209 followers

29961760 users reached by 1-degree connections

Average number of followers for PacktPub's followers: 2934.84

Favorited 3554 times (0.33 per tweet, 0.35 per user)

Retweeted 6434 times (0.6 per tweet, 0.63 per user)

As you can see, the raw number of followers shows no contest, with Packt Publishing having approximatively 35 times more followers than me. The interesting part of this analysis comes up when we compare the average number of retweets and favorites, apparently my followers are much more engaged with my content than PacktPub’s. Is this enough to declare than I’m an influencer while PacktPub is not? Clearly not. What we observe here is a natural consequence of the fact that my tweets are probably more focused on specific topics (Python and data science), hence my followers are already more interested in what I’m publishing. On the other side, the content produced by Packt Publishing is highly diverse as it ranges across many different technologies. This diversity is also reflected in PacktPub’s followers, who include developers, designers, scientists, system administrator, and so on. For this reason, each of PacktPub’s tweet is found interesting (that is worth retweeting) by a smaller proportion of their followers.

Summary

In this article,we discussed mining data from Twitter by focusing on the analysis of user connections and interactions. In particular, we discussed how to compare influence and engagement between users.

For more information on social media mining, refer the following books by Packt Publishing:

Social Media Mining with R: https://www.packtpub.com/big-data-and-business-intelligence/social-media-mining-r
Mastering Social Media Mining with R: https://www.packtpub.com/big-data-and-business-intelligence/mastering-social-media-mining-r

Further resources on this subject:

Probabilistic Graphical Models in R [article]
Machine Learning Tasks [article]
Support Vector Machines as a Classification Engine [article]

Mining Twitter with Python – Influence and Engagement

Measuring influence and engagement

Summary

NO COMMENTS

LEAVE A REPLY Cancel reply