30 min read

In this article, we will cover:

  • Twitter and it’s importance
  • Getting hands on with Twitter’s data and using various Twitter APIs
  • Use of data to solve business problems—comparison of various businesses based on tweets

(For more resources related to this topic, see here.)

Twitter and its importance

Twitter can be considered as extension of the short messages service or SMS but on an Internet-based platform. In the words of Jack Dorsey, co-founder and co-creator of Twitter:

“…We came across the word ‘twitter’, and it was just perfect. The definition was ‘a short burst of inconsequential information,’ and ‘chirps from birds’. And that’s exactly what the product was”

Twitter acts as a utility where one can send their SMSs to the whole world. It enables people to instantaneously get heard and get a response. Since the audience of this SMS is so large, many a times responses are very quick. So, Twitter facilitates the basic social instincts of humans. By sharing on Twitter, a user can easily express his/her opinion for just about everything and at anytime. Friends who are connected or, in case of Twitter, followers, immediately get the information about what’s going on in someone’s life. This in turn severs another humanemotion—the innate need to know about what is going on in someone’s life. Apart from being real time, Twitter’s UI is really easy to work with. It’s naturally and instinctively understood, that is, the UI is very intuitive in nature.

Each tweet on Twitter is a short message with maximum of 140 characters. Twitter is an excellent example of a microblogging service. As of July 2014, the Twitter user base reached above 500 million, with more than 271 million active users. Around 23 percent are adult Internet users, which is also about 19 percent of the entire adult population.

If we can properly mine what users are tweeting about, Twitter can act as a great tool for advertisement and marketing. But this not the only information Twitter provides. Because of its non-symmetric nature in terms of followers and followings, Twitter assists better in terms of understanding user interests rather than its impact on the social network. An interest graph can be thought of as a method to learn the links between individuals and their diverse interests. Computing the degree of association or correlations between individual’s interests and the potential advertisements are one of the most important applications of the interest graphs. Based on these correlations, a user can be targeted so as to attain a maximum response to an advertisement campaign along with followers’ recommendations.

One interesting fact about Twitter (and Facebook) is that the user does not need to be a real person. A user on Twitter (or on Facebook) can be anything and anyone, for example, an organization, a campaign itself, a famous but imaginary personality (a fictional character recognizable in the media) apart from a real/actual person. If a real person follows these users on Twitter, a lot can be inferred about their personality and hence they can be recommended ads or other followers based on such information. For example, @fakingnews is an Indian blog that publishes news satires ranging from Indian politics to typical Indian mindsets. People who follow @fakingnews are the ones who, in general, like to read sarcasm news. Hence, these people can be thought of as to belonging to the same cluster or a community. If we have another sarcastic blog, we can always recommend it to this community and improve on advertisement return on investment. The chances of getting more hits via people belonging to this community will be higher than a community who don’t follows @fakingnews, or any such news, in general.

Once you have comprehended that Twitter allows you to create, link, and investigate a community of interest for a random topic, the influence of Twitter and the knowledge one can find from mining it becomes clearer.

Understanding Twitter’s API

Twitter APIs provide a means to access the Twitter data, that is, tweets sent by its millions of users. Let’s get to know these APIs a bit better.

Twitter vocabulary

As described earlier, Twitter is a microblogging service with social aspect associated. It allows its users to express their views/sentiments with the means of Internet SMS, called tweets in the context of Twitter. These tweets are entities formed of maximum of 140 characters. The content of these tweets can be anything ranging from a person’s mood to person’s location to a person’s curiosity. The platform where these tweets are posted is called Timeline. To use Twitter’s APIs, one must understand the basic terminology.

Tweets are the crux of Twitter. Theoretically, a tweet is just 140 characters of text content tweeted by a user, but there is more to it than just that. There is more metadata associated with the same tweet, which are classified by Twitter as entities and places.

  • The entities constitute of hash tags, URLs, and other media data that users have included in their tweet.
  • The places are nothing but locations from where the tweet originated. It possible the place is a real world location from where the tweet was sent, or it is a location mentioned in the text of the tweet.

Take the following tweet as an example:

Learn how to consume millions of tweets with @twitterapi at #TDC2014 in São Paulo #bigdata tomorrow at 2:10pm http://t.co/pTBlWzTvVd

The preceding tweet was tweeted by @TwitterDev and it’s about 132 characters long. The following are the entities mentioned in this tweet:

São Paulo is the place mentioned in this tweet.

This is a one such example of a tweet with a fairly good amount of metadata. Although the actual tweet’s length is well within the 140-character limit, it contains more information than one can think of. This actually enables us to figure out that this tweet belongs to a specific community based on the cross referencing the topics presents in the hash tags, the URL to the website, the different users mentioned in it, and so on. The interface (web or mobile) on to which the tweets are displayed is called timeline. The tweets are, in general, arranged in chronological order of posting time. On a specific user’s account, only certain number of tweets are displayed by Twitter. This is generally based on users the given user is following and is being followed by. This is the interface a user will see when he/she login his/her Twitter account.

A Twitter stream is different from Twitter timeline in the sense that they are not for a specific user. The Tweets on a user’s Twitter timeline will be displayed from only certain number of users will be displayed/updated less frequently while the Twitter stream is chronological collection of the all the tweets posted by all the users. The number of active users on Twitter is in orders of hundreds of millions. All the users tweeting during some public events of widespread interest such as presidential debates can achieve speeds of several hundreds of thousands of tweets per minute. The behavior is very similar to a stream; hence the name of such collection is Twitter stream.

You can try the following by creating a Twitter account (it would be more insightful if you have less number of followers already with you). Before creating the account, it is advised that you read all the terms and conditions of the same. You can also start reading its API’s documentation.

Creating a Twitter API connection

We need to have an app created at https://dev.twitter.com/apps before making any API requests to Twitter. It’s a standard method for developers to gain API access and more important it helps Twitter to observe and restricts developer from making high load API requests.

The ROAuth package is the one we are going to use in our experiments. Tokens allow users to authorize third-party apps to access the data from any user account without the need to have their passwords (or other sensitive information). ROAuth basically facilitates the same.

Creating new app

The first step to getting any kind of token access from twitter is to create an app on it. The user has to go to https://dev.twitter.com/ and log in with their Twitter credentials. With you logged in using your credentials, the step for creating app are as follows:

  1. Go to https://apps.twitter.com/app/new.
  2. Put the name of your application in the Name field. This name can be anything you like.
  3. Similarly, enter the description in the Description field.
  4. The Website field needs to be filled with a valid URL, but again that can be any random URL.
  5. You can leave the Callback URL field blank.

After the creation of this app, we need to find the API Key and API Secret values from the Key and Access Token tab. Consider the example shown in the following figure:

 

Under the Key and Access Tokens tab, you will find a button to generate access tokens. Click on it and you will be provided with an Access Token and Access Token Secret value.

Before using the preceding keys, we need to install twitteRto access the data in R using the app we just created, using following code:

Install.packages(c("devtools", "rjson", "bit64", "httr"))

library(devtools)

install_github("geoffjentry/twitteR").

library(twitteR)

Here’s sample code that helps us access the tweets posted since any give date and which contain a specific keyword. In this example, we are searching for tweeting containing the word Earthquake in the tweets posted since September 29, 2014. In order to get this information, we provide four special types of information to get the authorization token:

  • key
  • secret
  • access token
  • access token secret

We’ll show you how to use the preceding information to get an app authorized by the user and access its resources on Twitter. The ROAuh function in twitteR will make our next steps very smooth and clear:

api_key<- "your_api_key"

api_secret<- "your_api_secret"

access_token<- "your_access_token"

access_token_secret<- "your_access_token_secret"

setup_twitter_oauth
(api_key,api_secret,access_token,access_token_secret)

EarthQuakeTweets = searchTwitter("EarthQuake", since='2014-09-29')

The results of this example should simply display Using direct authentication with 25 tweets loaded in the EarthQuakeTweets variable as shown here.

head(EarthQuakeTweets,2)

[[1]]

[1] "TamamiJapan: RT @HistoricalPics: Japan. Top: One Month After Hiroshima, 1945. Bottom: One Month After The Earthquake and Tsunami, 2011. Incredible. http…"

 

[[2]]

[1] "OldhamDs: RT @HistoricalPics: Japan. Top: One Month After Hiroshima, 1945. Bottom: One Month After The Earthquake and Tsunami, 2011. Incredible. http…"

We have shown in the first two of the 25 tweets containing the word Earthquake since September 29, 2014. If you closely observe the results, you’ll find all the metadata using str(EarthQuakeTweets[1]).

Finding trending topics

Now that we understand how to create API connections to Twitter and fetch data using it, we will see how to get answer to what is trending on Twitter to list what topic (worldwide or local) is being talked about the most right now. Using the same API, we can easily access the trending information:

#return data frame with name, country & woeid.

Locs <- availableTrendLocations()

# Where woeid is a numerical identification code describing a location ID

 

# Filter the data frame for Delhi (India) and extract the woeid of the same

LocsIndia = subset(Locs, country == "India")

woeidDelhi = subset(LocsIndia, name == "Delhi")$woeid

 

# getTrends takes a specified woeid and returns the trending topics associated with that woeid

trends = getTrends(woeid=woeidDelhi)

The function availableTrendLocations() returns R data frame containing the name, country, and woeid parameters. We than filter this data frame for a location of our choosing; in this example, its Delhi, India. The function getTrends() fetches the top 10 trends in the location determined by the woeid.

Here are the top four trending hash tags in the region defined by woeid = 20070458, that is, Delhi, India.

head(trends)

name         url                   query               woeid

1 #AntiHinduNGOsExposed http://twitter.com/search?q=%23AntiHinduNGOsExposed %23AntiHinduNGOsExposed 20070458

2           #KhaasAadmi           http://twitter.com/search?q=%23KhaasAadmi           %23KhaasAadmi
20070458

3           #WinGOSF14           http://twitter.com/search?q=%23WinGOSF14           %23WinGOSF14
20070458

4     #ItsForRealONeBay     http://twitter.com/search?q=%23ItsForRealONeBay    
%23ItsForRealONeBay 20070458

Searching tweets

Now, similar to the trends there is one more important function that comes with the TwitteR package: searchTwitter(). This function will return tweets containing the searched string along with the other constraints. Some of the constraints that can be imposed are as follows:

  • lang: This constraints the tweets of given language.
  • since/until: This constraints the tweets to be since the given date or until the given date.
  • geocode: This constraints tweets to be from only those users who are located within certain distance from the given latitude/longitude.

For example, extracting tweets about the cricketer Sachin Tendulkar in the month of November 2014:

head(searchTwitter('Sachin Tendulkar', since='2014-11-01',
until= '2014-11-30'))

 

[[1]]

[1] "TendulkarFC: RT @Moulinparikh: Sachin Tendulkar had a long
session with the Mumbai Ranji Trophy team after today's loss."

 

[[2]]

[1] "tyagi_niharika: @WahidRuba @Anuj_dvn @Neel_D_ @alishatariq3
@VWellwishers @Meenal_Rathore oh... Yaadaaya....hmaraesachuuu
sirxedxa0xbdxedxb8x8d..i mean sachin Tendulkar"

 

[[3]]

[1] "Meenal_Rathore: @WahidRuba @Anuj_dvn @tyagi_niharika @Neel_D_
@alishatariq3 @AliaaFcc @VWellwishers .. Sachin Tendulkar
xedxa0xbdxedxb8x8a☺️"

 

[[4]]

[1] "MishraVidyanand: Vidyanand Mishra is following the Interest
"The Living Legend SachinTendu..." on http://t.co/tveHXMB4BM -
http://t.co/CocNMcxFge"

 

[[5]]

[1] "CSKalwaysWin: I have never tried to compare myself to anyone
else.n - Sachin Tendulkar"

Twitter sentiment analysis

Depending on the objective and based on the functionality to search any type of tweets from the public timeline, one can always collect the required corpus. For example, you may want to learn about customer satisfaction levels with various cab services, which are coming in Indian market. These start-ups are offering various discounts and coupons to attract customers but at the end of the day, the service quality determines the business of any organization. These startups are constantly promoting themselves on various social media websites. Customers are showing various levels of sentiments on the same platform.

Let’s target the following:

  • Meru Cabs: A radio cabs service based in Mumbai, India. Launched in 2007.
  • Ola Cabs: A taxi aggregator company based in Bangalore, India. Launched in 2011.
  • TaxiForSure: A taxi aggregator company based in Bangalore, India. Launched in 2011.
  • Uber India: A taxi aggregator company headquartered in San Francisco, California. Launched in India in 2014.

Let’s set our goal to get the general sentiments about each of the preceding services providers based on the customer sentiments present in the tweets on Twitter.

Collecting tweets as a corpus

We’ll start with the searchTwitter()function (discussed previously) on the TwitteR package to gather the tweets for each of the preceding organizations.

Now, in order to avoid writing same code again and again, we pushed the following authorization code in the file called authenticate.R.

library(twitteR)

api_key<- "xx"

api_secret<- "xx"

access_token<- "xx"

access_token_secret<- "xx"

setup_twitter_oauth(api_key,api_secret,access_token,
access_token_secret)

We run the following scripts to get the required tweets:

# Load the necessary packages

source('authenticate.R')

 

Meru_tweets = searchTwitter("MeruCabs", n=2000, lang="en")

Ola_tweets = searchTwitter("OlaCabs", n=2000, lang="en")

TaxiForSure_tweets =
searchTwitter("TaxiForSure", n=2000, lang="en")

Uber_tweets = searchTwitter("Uber_Delhi", n=2000, lang="en")

Now, as mentioned in Twitter’s Rest API documentation, we get the message “Due to capacity constraints, the index currently only covers about a week’s worth of tweets”. We do not always get the desired number of tweets (for example, here it’s 2000). Instead, the following are the size of each of the above Tweet lists we get the following:

>length(Meru_tweets)

[1] 393

>length(Ola_tweets)

[1] 984

> length(TaxiForSure_tweets)

[1] 720

> length(Uber_tweets)

[1] 2000

As you can see from the preceding code, the length of these tweets is not equal to the number of tweets we had asked for in our query scripts. There are many takeaways from this information. Since these tweets are only from last one week’s tweets on Twitter, they suggest there is more discussion about these taxi services in the following order:

  • Uber India
  • Ola Cabs
  • TaxiForSure
  • Meru Cabs

A ban was imposed on Uber India after an alleged rape incident by one Uber India driver. The decision to put a ban on the entire organization because one of its drivers committed a crime became a matter of public outcry. Hence, the number of tweets about Uber increased on social media. Now, Meru Cabs have been in India for almost 7 years now. Hence, they are quite a stable organization. They amount of promotion Ola Cabs and TaxiForSure are doing is way higher than that of Meru Cabs. This can be one reason for Meru Cabs having theleast number (393) of tweets in last week. The number of tweets in last week is comparable for Ola Cabs (984) and TaxiForSure (720). There can be several numbers of reasons for the same. They were both started their business in same year and more importantly they follow the same business model. While Meru Cabs is a radio taxi service and they own and manage a fleet of cars while Ola Cabs, TaxiForSure, or Uber are a marketplace for users to compare the offerings of various operators and book easily.

Let’s dive deep into the data and get more insights.

Cleaning the corpus

Before applying any intelligent algorithms to gather more insights out of the tweets collected so far, let’s first clean it. In order to clean up, we should understand how the list of tweets looks like:

head(Meru_tweets)

[[1]]

[1] "MeruCares: @KapilTwitts 2&gt;...and other details at
[email protected] We'll check back and reach out soon."

 

[[2]]

[1] "vikasraidhan: @MeruCabs really disappointed with @GenieCabs.
Cab is never assigned on time. Driver calls after 30 minutes. Why
would I ride with Meru?"

 

[[3]]

[1] "shiprachowdhary: fallback of #ubershame , #MERUCABS taking
customers for a ride"

 

[[4]]

[1] "shiprachowdhary: They book Genie, but JIT inform of
cancellation &amp; send full fare #MERUCABS . Very
disappointed.Always used these guys 4 and recommend them."

 

[[5]]

[1] "shiprachowdhary: No choice bt to take the #merucabs premium
service. Driver told me that this happens a lot with #merucabs."

 

[[6]]

[1] "shiprachowdhary: booked #Merucabsyestrdy. Asked for Meru
Genie. 10 mins 4 pick up time, they call to say Genie not available, so sending the full fare cab"

The first tweet here is a grievance solution, while the second, fourth and fifth are actually customer sentiments about the services provided by Meru Cabs. We see:

  • Lots of meta information such as @people, URLs and #hashtags
  • Punctuation marks, numbers, and unnecessary spaces
  • Some of these tweets are retweets from other users; for the given application, we would not like to consider retweets (RTs) in sentiment analysis

We clean all these data using the following code block:

MeruTweets <- sapply(Meru_tweets, function(x) x$getText())

OlaTweets = sapply(Ola_tweets, function(x) x$getText())

TaxiForSureTweets = sapply(TaxiForSure_tweets,
function(x) x$getText())

UberTweets = sapply(Uber_tweets, function(x) x$getText())

 

catch.error = function(x)

{

# let us create a missing value for test purpose

y = NA

# Try to catch that error (NA) we just created

catch_error = tryCatch(tolower(x), error=function(e) e)

# if not an error

if (!inherits(catch_error, "error"))

   y = tolower(x)

# check result if error exists, otherwise the function works fine.

return(y)

}

 

cleanTweets<- function(tweet){

# Clean the tweet for sentiment analysis

# remove html links, which are not required for sentiment analysis

tweet = gsub("(f|ht)(tp)(s?)(://)(.*)[.|/](.*)", " ", tweet)

# First we will remove retweet entities from
the stored tweets (text)

tweet = gsub("(RT|via)((?:\b\W*@\w+)+)", " ", tweet)

# Then remove all "#Hashtag"

tweet = gsub("#\w+", " ", tweet)

# Then remove all "@people"

tweet = gsub("@\w+", " ", tweet)

# Then remove all the punctuation

tweet = gsub("[[:punct:]]", " ", tweet)

# Then remove numbers, we need only text for analytics

tweet = gsub("[[:digit:]]", " ", tweet)

# finally, we remove unnecessary spaces (white spaces, tabs etc)

tweet = gsub("[ t]{2,}", " ", tweet)

tweet = gsub("^\s+|\s+$", "", tweet)

# if anything else, you feel, should be removed, you can.
For example "slang words" etc using the above function and methods.

# Next we'll convert all the word in lower case.
This makes uniform pattern.

tweet = catch.error(tweet)

tweet

}

 

cleanTweetsAndRemoveNAs<- function(Tweets) {

TweetsCleaned = sapply(Tweets, cleanTweets)

# Remove the "NA" tweets from this tweet list

TweetsCleaned = TweetsCleaned[!is.na(TweetsCleaned)]

names(TweetsCleaned) = NULL

# Remove the repetitive tweets from this tweet list

TweetsCleaned = unique(TweetsCleaned)

TweetsCleaned

}

 

MeruTweetsCleaned = cleanTweetsAndRemoveNAs(MeruTweets)

OlaTweetsCleaned = cleanTweetsAndRemoveNAs(OlaTweets)

TaxiForSureTweetsCleaned <-
cleanTweetsAndRemoveNAs(TaxiForSureTweets)

UberTweetsCleaned = cleanTweetsAndRemoveNAs(UberTweets)

Here’s the size of each of the cleaned tweet lists:

> length(MeruTweetsCleaned)

[1] 309

> length(OlaTweetsCleaned)

[1] 811

> length(TaxiForSureTweetsCleaned)

[1] 574

> length(UberTweetsCleaned)

[1] 1355

Estimating sentiment (A)

There are many sophisticated resources available to estimate sentiments. Many research papers and software packages are available open source,and they implement very complex algorithms for sentiments analysis. After getting the cleaned Twitter data, we are going to use few of such R packages available to assess the sentiments in the tweets.

It’s worth mentioning here that not all the tweets represent a sentiment. Few tweets can be just information/facts, while others can be customer care responses. Ideally, they should not be used to assess the customer sentiment about a particular organization.

As a first step, we’ll use a Naïve algorithm, which gives a score based on the number of times a positive or a negative word occurred in the given sentence (and in our case, in a tweet).

Please download the positive and negative opinion/sentiment (nearly 68, 000) words from English language. These opinion lexicon will be used as a first example in our sentiment analysis experiment. The good thing about this approach is that we are relying on a highly researched upon and at the same time customizable input parameters. Here are a few examples of existing positive and negative sentiments words:

  • Positive: Love, best, cool, great, good, and amazing
  • Negative: Hate, worst, sucks, awful, and nightmare
>opinion.lexicon.pos =
scan('opinion-lexicon-English/positive-words.txt',
what='character', comment.char=';')

>opinion.lexicon.neg =
scan('opinion-lexicon-English/negative-words.txt',
what='character', comment.char=';')

> head(opinion.lexicon.neg)

[1] "2-faced"   "2-faces"   "abnormal"   "abolish"  
"abominable" "abominably"

> head(opinion.lexicon.pos)

[1] "a+"         "abound"     "abounds"   "abundance" "abundant"  
"accessable"

We’ll add a few industry-specific and/or especially emphatic terms based on our requirements:

pos.words = c(opinion.lexicon.pos,'upgrade')

neg.words = c(opinion.lexicon.neg,'wait',
'waiting', 'wtf', 'cancellation')

Now, we create a function score.sentiment(), which computes the raw sentiment based on the simple matching algorithm:

getSentimentScore = function(sentences, words.positive,
words.negative, .progress='none')

{

require(plyr)

require(stringr)

 

scores = laply(sentences,
function(sentence, words.positive, words.negative) {

 

   # Let first remove the Digit, Punctuation character and Control characters:

   sentence = gsub('[[:cntrl:]]', '', gsub('[[:punct:]]', '',
gsub('\d+', '', sentence)))

 

   # Then lets convert all to lower sentence case:

   sentence = tolower(sentence)

 

   # Now lets split each sentence by the space delimiter

   words = unlist(str_split(sentence, '\s+'))

 

   # Get the boolean match of each words with the positive & negative opinion-lexicon

   pos.matches = !is.na(match(words, words.positive))

   neg.matches = !is.na(match(words, words.negative))

 

   # Now get the score as total positive sentiment minus the total negatives

   score = sum(pos.matches) - sum(neg.matches)

 

   return(score)

}, words.positive, words.negative, .progress=.progress )

 

# Return a data frame with respective sentence and the score

return(data.frame(text=sentences, score=scores))

}

Now, we apply the preceding function on the corpus of tweets collected and cleaned so far:

MeruResult = getSentimentScore(MeruTweetsCleaned, words.positive ,
words.negative)

OlaResult = getSentimentScore(OlaTweetsCleaned, words.positive ,
words.negative)

TaxiForSureResult = getSentimentScore(TaxiForSureTweetsCleaned,
words.positive , words.negative) UberResult =
getSentimentScore(UberTweetsCleaned, words.positive ,
words.negative)

Here are some sample results:

Tweet for Meru Cabs

Score

gt and other details at feedback com we ll check back and reach out soon

0

really disappointed with cab is never assigned on time driver calls after minutes why would i ride with meru

-1

so after years of bashing today i m pleasantly surprised clean car courteous driver prompt pickup mins efficient route

4

a min drive cost hrs used to cost less ur unreliable and expensive trying to lose ur customers

-3

Tweet For Ola Cabs

Score

the service is going from bad to worse the drivers deny to come after a confirmed booking

-3

love the olacabs app give it a swirl sign up with my referral code dxf n and earn rs download the app from

1

crn kept me waiting for mins amp at last moment driver refused pickup so unreliable amp irresponsible

-4

this is not the first time has delighted me punctuality and free upgrade awesome that

4

Tweet For TaxiForSure

Score

great service now i have become a regular customer of tfs thank you for the upgrade as well happy taxi ing saving

5

really disappointed with cab is never assigned on time driver calls after minutes why would i ride with meru

-1

horrible taxi service had to wait for one hour with a new born in the chilly weather of new delhi waiting for them

-4

what do i get now if you resolve the issue after i lost a crucial business because of the taxi delay

-3

Tweet For Uber India

Score

that s good uber s fares will prob be competitive til they gain local monopoly then will go sky high as in new york amp delhi saving

3

from a shabby backend app stack to daily pr fuck ups its increasingly obvious that is run by child minded blow hards

-3

you say that uber is illegally running were you stupid to not ban earlier and only ban it now after the rape

-3

perhaps uber biz model does need some looking into it s not just in delhi that this happens but in boston too

0

From the preceding observations, it’s clear that this basic sentiment analysis method works fine in normal circumstances, but in case of Uber India the results deviated too much from a subjective score. It’s safe to say that basic word matching gives a good indicator of overall customer sentiments, except in the case when the data itself is not reliable. In our case, the tweets from Uber India are not really related to the services that Uber provides, rather the one incident of crime by its driver and whole score went haywire.

Let’s not compute a point statistic of the scores we have computed so far. Since the numbers of tweets are not equal for each of the four organizations, we compute a mean and standard deviation for each.

Organization

Mean Sentiment Score

Standard Deviation

Meru Cabs

-0.2218543

1.301846

Ola Cabs

0.197724

1.170334

TaxiForSure

-0.09841828

1.154056

Uber India

-0.6132666

1.071094

Estimating sentiment (B)

Let’s now move one step further. Now instead of using simple matching of opinion lexicon, we’ll use something called Naive Bayes to decide on the emotion present in any tweet. We would require packages called Rstem and sentiment to assist in this. It’s important to mention here that both these packages are no longer available in CRAN and hence we have to provide either the repository location as a parameter install.package() function. Here’s the R script to install the required packages:

install.packages("Rstem",
repos = "http://www.omegahat.org/R", type="source")

require(devtools)

install_url("http://cran.r-project.org/src/contrib/Archive/sentiment/sentiment_0.2.tar.gz") require(sentiment)

ls("package:sentiment")

Now that we have the sentiment and Rstem packages installed in our R workspace, we can build the bayes classifier for sentiment analysis:

library(sentiment)

# classify_emotion function returns an object of class data frame #
with seven columns (anger, disgust, fear, joy, sadness, surprise, #
# best_fit) and one row for each document:

MeruTweetsClassEmo = classify_emotion(MeruTweetsCleaned,
algorithm="bayes", prior=1.0)

OlaTweetsClassEmo = classify_emotion(OlaTweetsCleaned,
algorithm="bayes", prior=1.0)

TaxiForSureTweetsClassEmo =
classify_emotion(TaxiForSureTweetsCleaned, algorithm="bayes",
prior=1.0)

UberTweetsClassEmo = classify_emotion(UberTweetsCleaned,
algorithm="bayes", prior=1.0)

The following figure shows few results from Bayesian analysis using thesentiment package for Meru Cabs tweets. Similarly, we generated results for other cab-services from our problem setup.

The sentiment package was built to use a trained dataset of emotion words (nearly 1500 words). The function classify_emotion() generates results belonging to one of the following six emotions: anger, disgust, fear, joy, sadness, and surprise. Hence, when the system is not able to classify the overall emotion to any of the six,NA is returned:

Let’s substitute these NA values with the word unknown to make the further analysis easier:

# we will fetch emotion category best_fit for our analysis purposes.

MeruEmotion = MeruTweetsClassEmo[,7]

OlaEmotion = OlaTweetsClassEmo[,7]

TaxiForSureEmotion = TaxiForSureTweetsClassEmo[,7]

UberEmotion = UberTweetsClassEmo[,7]

 

MeruEmotion[is.na(MeruEmotion)] = "unknown"

OlaEmotion[is.na(OlaEmotion)] = "unknown"

TaxiForSureEmotion[is.na(TaxiForSureEmotion)] = "unknown"

UberEmotion[is.na(UberEmotion)] = "unknown"

The best-fit emotions present in these tweets are as follows:

Further, we’ll use another function classify_polarity() provided by the sentiment package to classify the tweets into two classes, pos (positive sentiment) or neg (negative sentiment). The idea is to compute the log likelihood of a tweet assuming it to belong to either of two classes. Once these likelihoods are calculated, a ratio of the pos-likelihood to neg-likelihood is calculated and based on this ratio the tweets are classified to belong to a particular class. It’s important to note that if this ratio turns out to be 1, then the overall sentiment of the tweet is assumed to be “neutral”. The code is as follows:

MeruTweetsClassPol = classify_polarity(MeruTweetsCleaned,
algorithm="bayes")

OlaTweetsClassPol = classify_polarity(OlaTweetsCleaned,
algorithm="bayes")

TaxiForSureTweetsClassPol =
classify_polarity(TaxiForSureTweetsCleaned, algorithm="bayes")

UberTweetsClassPol = classify_polarity(UberTweetsCleaned,
algorithm="bayes")

We get the following output:

The preceding figure shows few results from obtained using the classify_polarity() function of sentiment package for Meru Cabs tweets. We’ll now generate consolidated results from the two functions in a data frame for each cab service for plotting purposes:

# we will fetch polarity category best_fit for our analysis purposes,

MeruPol = MeruTweetsClassPol[,4]

OlaPol = OlaTweetsClassPol[,4]

TaxiForSurePol = TaxiForSureTweetsClassPol[,4]

UberPol = UberTweetsClassPol[,4]

 

# Let us now create a data frame with the above results

MeruSentimentDataFrame = data.frame(text=MeruTweetsCleaned,
emotion=MeruEmotion, polarity=MeruPol, stringsAsFactors=FALSE)

OlaSentimentDataFrame = data.frame(text=OlaTweetsCleaned,
emotion=OlaEmotion, polarity=OlaPol, stringsAsFactors=FALSE)

TaxiForSureSentimentDataFrame =
data.frame(text=TaxiForSureTweetsCleaned,
emotion=TaxiForSureEmotion, polarity=TaxiForSurePol,
stringsAsFactors=FALSE)

UberSentimentDataFrame = data.frame(text=UberTweetsCleaned,
emotion=UberEmotion, polarity=UberPol, stringsAsFactors=FALSE)

 

# rearrange data inside the frame by sorting it

MeruSentimentDataFrame = within(MeruSentimentDataFrame, emotion <-
factor(emotion, levels=names(sort(table(emotion),
decreasing=TRUE))))

OlaSentimentDataFrame = within(OlaSentimentDataFrame, emotion <-
factor(emotion, levels=names(sort(table(emotion),
decreasing=TRUE))))

TaxiForSureSentimentDataFrame =
within(TaxiForSureSentimentDataFrame, emotion <- factor(emotion,
levels=names(sort(table(emotion), decreasing=TRUE))))

UberSentimentDataFrame = within(UberSentimentDataFrame, emotion <-
factor(emotion, levels=names(sort(table(emotion),
decreasing=TRUE))))

plotSentiments1<- function (sentiment_dataframe,title) {

library(ggplot2)

ggplot(sentiment_dataframe, aes(x=emotion)) +
geom_bar(aes(y=..count.., fill=emotion)) +

scale_fill_brewer(palette="Dark2") +

ggtitle(title) +

   theme(legend.position='right') + ylab('Number of Tweets') +
xlab('Emotion Categories')

}

 

plotSentiments1(MeruSentimentDataFrame, 'Sentiment Analysis of
Tweets on Twitter about MeruCabs')

plotSentiments1(OlaSentimentDataFrame, 'Sentiment Analysis of
Tweets on Twitter about OlaCabs')

plotSentiments1(TaxiForSureSentimentDataFrame, 'Sentiment Analysis
of Tweets on Twitter about TaxiForSure')

plotSentiments1(UberSentimentDataFrame, 'Sentiment Analysis of
Tweets on Twitter about UberIndia')

The output is as follows:

In the preceding figure, we showed sample results using generated results on Meru Cabs tweets using both the functions. Let’s now plot them one by one. First, let’s create a single function to be used by each business’s tweets. We call it plotSentiments1() and then we plot it for each business:

The following dashboard shows the analysis for Ola Cabs:

The following dashboard shows the analysis for TaxiForSure:

The following dashboard shows the analysis for Uber India:

These sentiments basically reflect the more or less the same observations as we did with the basic word-matching algorithm. The number of tweets with joy constitute the largest part of tweets for all these organizations, indicating that these organizations are trying their best to provide good business in the country. The sadness tweets are less numerous than the joy tweets. However, if compared with each other, they indicate the overall market share versus level of customer satisfaction of each service provider in question. Similarly, these graphs can be used to assess the level of dissatisfaction in terms of anger and disgust in the tweets. Let’s now consider only the positive and negative sentiments present in the tweets:

# Similarly we will plot distribution of polarity in the tweets

plotSentiments2 <- function (sentiment_dataframe,title) {

library(ggplot2)

ggplot(sentiment_dataframe, aes(x=polarity)) +

geom_bar(aes(y=..count.., fill=polarity)) +

scale_fill_brewer(palette="RdGy") +

ggtitle(title) +

   theme(legend.position='right') + ylab('Number of Tweets') +
xlab('Polarity Categories')

}

 

plotSentiments2(MeruSentimentDataFrame, 'Polarity Analysis of
Tweets on Twitter about MeruCabs')

plotSentiments2(OlaSentimentDataFrame, 'Polarity Analysis of
Tweets on Twitter about OlaCabs')

plotSentiments2(TaxiForSureSentimentDataFrame, 'Polarity Analysis
of Tweets on Twitter about TaxiForSure')

plotSentiments2(UberSentimentDataFrame, 'Polarity Analysis of
Tweets on Twitter about UberIndia')

The output is as follows:

The following dashboard shows the polarity analysis for Ola Cabs:

The following dashboard shows the analysis for TaxiForSure:

The following dashboard shows the analysis for Uber India:

It’s a basic human trait to inform about other’s what’s wrong rather than informing if there was something right. That is say that we tend to tweets/report if something bad had happened rather reporting/tweeting if the experience was rather good. Hence, the negative tweets are supposed to be larger than the positive tweets in general. Still over a period of time (a week in our case) the ratio of the two easily reflect the overall market share versus the level of customer satisfaction of each service provider in question. Next, we try to get the sense of the overall content of the tweets using the word clouds.

removeCustomeWords <- function (TweetsCleaned) {

for(i in 1:length(TweetsCleaned)){

   TweetsCleaned[i] <- tryCatch({

     TweetsCleaned[i] = removeWords(TweetsCleaned[i],
c(stopwords("english"), "care", "guys", "can", "dis", "didn",
"guy" ,"booked", "plz"))

     TweetsCleaned[i]

   }, error=function(cond) {

     TweetsCleaned[i]

   }, warning=function(cond) {

     TweetsCleaned[i]

   })

}

return(TweetsCleaned)

}

 

getWordCloud <- function
(sentiment_dataframe, TweetsCleaned, Emotion) {

emos = levels(factor(sentiment_dataframe$emotion))

n_emos = length(emos)

emo.docs = rep("", n_emos)

TweetsCleaned = removeCustomeWords(TweetsCleaned)

 

for (i in 1:n_emos){

   emo.docs[i] = paste(TweetsCleaned[Emotion ==
emos[i]], collapse=" ")

}

corpus = Corpus(VectorSource(emo.docs))

tdm = TermDocumentMatrix(corpus)

tdm = as.matrix(tdm)

colnames(tdm) = emos

require(wordcloud)

suppressWarnings(comparison.cloud(tdm, colors =
brewer.pal(n_emos, "Dark2"), scale = c(3,.5), random.order =
FALSE, title.size = 1.5))

}

getWordCloud(MeruSentimentDataFrame, MeruTweetsCleaned,
MeruEmotion)

getWordCloud(OlaSentimentDataFrame, OlaTweetsCleaned, OlaEmotion)

getWordCloud(TaxiForSureSentimentDataFrame, TaxiForSureTweetsCleaned, TaxiForSureEmotion)

getWordCloud(UberSentimentDataFrame, UberTweetsCleaned, UberEmotion)

The preceding figure shows word cloud from tweets about Meru Cabs.

The preceding figure shows word cloud from tweets about Ola Cabs.

The preceding figure shows word cloud from tweets about TaxiForSure.

The preceding figure shows word cloud from tweets about Uber India.

Summary

In this article, we gained knowledge of the various Twitter APIs, we discussed how to create a connection with Twitter, and we saw how to retrieve the tweets with various attributes. We saw the power of Twitter in helping us determine the customer attitude toward today’s various businesses. The activity can be done on the weekly basis and one can easily get the monthly or quarterly or yearly changes in customer sentiments. This can not only help the customer decide the trending businesses, but the business itself can get a well-defined metric of its own performance. It can use such scores/graphs to improve. We also discussed various methods of sentiment analysis varying from basic word matching to the advanced Bayesian algorithms.

Resources for Article:


Further resources on this subject:


LEAVE A REPLY

Please enter your comment!
Please enter your name here