10 min read

[box type=”note” align=”” class=”” width=””]This interesting article is an excerpt from the book Python Social Media Analytics, written by Siddhartha Chatterjee and Michal Krystyanczuk. The book contains useful techniques to gain valuable insights from different social media channels using popular Python packages.[/box]

In this article, we explore how to leverage the power of Python in order to gather and process data from GitHub and make it analysis-ready.

Those who love to code, love GitHub. GitHub has taken the widely used version controlling approach to coding to the highest possible level by implementing social network features to the world of programming. No wonder GitHub is also thought of as Social Coding. We thought a book on Social Network analysis would not be complete without a use case on data from GitHub.

GitHub allows you to create code repositories and provides multiple collaborative features, bug tracking, feature requests, task managements, and wikis. It has about 20 million users and 57 million code repositories (source: Wikipedia). These kind of statistics easily demonstrate that this is the most representative platform of programmers. It’s also a platform for several open source projects that have contributed greatly to the world of software development. Programming technology is evolving at such a fast pace, especially due to the open source movement, and we have to be able to keep a track of emerging technologies. Assuming that the latest programming tools and technologies are being used with GitHub, analyzing GitHub could help us detect the most popular technologies. The popularity of repositories on GitHub is assessed through the number of commits it receives from its community. We will use the GitHub API in this chapter to gather data around repositories with the most number of commits and then discover the most popular technology within them. For all we know, the results that we get may reveal the next great innovations.

Scope and process

GitHub API allows us to get information about public code repositories submitted by users. It covers lots of open-source, educational and personal projects. Our focus is to find the trending technologies and programming languages of last few months, and compare with repositories from past years. We will collect all the meta information about the repositories such as:

  • Name: The name of the repository
  • Description: A description of the repository
  • Watchers: People following the repository and getting notified about its activity
  • Forks: Users cloning the repository to their own accounts
  • Open Issues: Issues submitted about the repository

We will use this data, a combination of qualitative and quantitative information, to identify the most recent trends and weak signals. The process can be represented by the steps shown in the following figure:

Mining Social media trends

Getting the data

Before using the API, we need to set the authorization. The API gives you access to all publicly available data, but some endpoints need user permission. You can create a new token with some specific scope access using the application settings. The scope depends on your application’s needs, such as accessing user email, updating user profile, and so on. Password authorization is only needed in some cases, like access by user authorized applications. In that case, you need to provide your username or email, and your password.

All API access is over HTTPS, and accessed from the https://api.github.com/ domain. All data is sent and received as JSON.

Rate Limits

The GitHub Search API is designed to help to find specific items (repository, users, and so on). The rate limit policy allows up to 1,000 results for each search. For requests using basic authentication, OAuth, or client ID and secret, you can make up to 30 requests per minute. For unauthenticated requests, the rate limit allows you to make up to 10 requests per minute.

Connection to GitHub

GitHub offers a search endpoint which returns all the repositories matching a query. As we go along, in different steps of the analysis we will change the value of the variable q (query). In the first part, we will retrieve all the repositories created since January 1, 2017 and then we will compare the results with previous years.

Firstly, we initialize an empty list results which stores all data about repositories. Secondly, we build get requests with parameters required by the API. We can only get 100 results per request, so we have to use a pagination technique to build a complete dataset.

results = []

q = "created:>2017-01-01"

def search_repo_paging(q):

url = 'https://api.github.com/search/repositories'

params = {'q' : q, 'sort' : 'forks', 'order': 'desc', 'per_page' : 100}

while True:

res = requests.get(url,params = params)

result = res.json()

results.extend(result['items'])

params = {}

try:

url = res.links['next']['url']

except:

break

In the first request we have to pass all the parameters to the GET method in our request. Then, we make a new request for every next page, which can be found in res.links['next']['url']. res. links contains a full link to the resources including all the other parameters. That is why we empty the params dictionary.

The operation is repeated until there is no next page key in res.links dictionary. For other datasets we modify the search query in such a way that we retrieve repositories from previous years. For example to get the data from 2015 we define the following query:

q = "created:2015-01-01..2015-12-31"

In order to find proper repositories, the API provides a wide range of query parameters. It is possible to search for repositories with high precision using the system of qualifiers.

Starting with main search parameters q, we have following options:

  • sort: Set to forks as we are interested in finding the repositories having the largest number of forks (you can also sort by number of stars or update time)
  • order: Set to descending order
  • per_page: Set to the maximum amount of returned repositories

Naturally, the search parameter q can contain multiple combinations of qualifiers.

Data pull

The amount of data we collect through GitHub API is such that it fits in memory. We can deal with it directly in a pandas dataframe. If more data is required, we would recommend storing it in a database, such as MongoDB.

We use JSON tools to convert the results into a clean JSON and to create a dataframe.

from pandas.io.json import json_normalize

import json

import pandas as pd

import bson.json_util as json_util

sanitized = json.loads(json_util.dumps(results))

normalized = json_normalize(sanitized)

df = pd.DataFrame(normalized)

The dataframe df contains columns related to all the results returned by GitHub API. We can list them by typing the following:

Df.columns

Index(['archive_url', 'assignees_url', 'blobs_url', 'branches_url',

'clone_url', 'collaborators_url', 'comments_url', 'commits_url',

'compare_url', 'contents_url', 'contributors_url', 'default_branch',

'deployments_url', 'description', 'downloads_url', 'events_url',

'Fork',

'forks', 'forks_count', 'forks_url', 'full_name', 'git_commits_url',

'git_refs_url', 'git_tags_url', 'git_url', 'has_downloads',

'has_issues', 'has_pages', 'has_projects', 'has_wiki', 'homepage',

'hooks_url', 'html_url', 'id', 'issue_comment_url',

'Issue_events_url',

'issues_url', 'keys_url', 'labels_url', 'language', 'languages_url',

'merges_url', 'milestones_url', 'mirror_url', 'name',

'notifications_url', 'open_issues', 'open_issues_count',

'owner.avatar_url', 'owner.events_url', 'owner.followers_url',

'owner.following_url', 'owner.gists_url', 'owner.gravatar_id',

'owner.html_url', 'owner.id', 'owner.login',

'Owner.organizations_url',

'owner.received_events_url', 'owner.repos_url', 'owner.site_admin',

'owner.starred_url', 'owner.subscriptions_url', 'owner.type',

'owner.url', 'private', 'pulls_url', 'pushed_at', 'releases_url',

'score', 'size', 'ssh_url', 'stargazers_count', 'stargazers_url',

'statuses_url', 'subscribers_url', 'subscription_url', 'svn_url',

'tags_url', 'teams_url', 'trees_url', 'updated_at', 'url',

'Watchers',

'watchers_count', 'year'],

dtype='object')

Then, we select a subset of variables which will be used for further analysis. Our choice is based on the meaning of each of them. We skip all the technical variables related to URLs, owner information, or ID. The remaining columns contain information which is very likely to help us identify new technology trends:

  • description: A user description of a repository
  • watchers_count: The number of watchers
  • size: The size of repository in kilobytes
  • forks_count: The number of forks
  • open_issues_count: The number of open issues
  • language: The programming language the repository is written in

We have selected watchers_count as the criterion to measure the popularity of repositories. This number indicates how many people are interested in the project. However, we may also use forks_count which gives us slightly different information about the popularity. The latter represents the number of people who actually worked with the code, so it is related to a different group.

Data processing

In the previous step we structured the raw data which is now ready for further analysis. Our objective is to analyze two types of data:

  • Textual data in description
  • Numerical data in other variables

Each of them requires a different pre-processing technique. Let’s take a look at each type in Detail.

Textual data

For the first kind, we have to create a new variable which contains a cleaned string. We will do it in three steps which have already been presented in previous chapters:

  • Selecting English descriptions
  • Tokenization
  • Stopwords removal

As we work only on English data, we should remove all the descriptions which are written in other languages. The main reason to do so is that each language requires a different processing and analysis flow. If we left descriptions in Russian or Chinese, we would have very noisy data which we would not be able to interpret. As a consequence, we can say that we are analyzing trends in the English-speaking world.

Firstly, we remove all the empty strings in the description column.

df = df.dropna(subset=['description'])

In order to remove non-English descriptions we have to first detect what language is used in each text. For this purpose we use a library called langdetect which is based on the Google language detection project (https://github.com/shuyo/language-detection).

from langdetect import detect

df['lang'] = df.apply(lambda x: detect(x['description']),axis=1)

We create a new column which contains all the predictions. We see different languages, such as en (English), zh-cn (Chinese), vi (Vietnamese), or ca (Catalan).

df['lang']

0 en

1 en

2 en

3 en

4 en

5 zh-cn

In our dataset en represents 78.7% of all the repositories. We will now select only those repositories with a description in English:

df = df[df['lang'] == 'en']

In the next step, we will create a new clean column with pre-processed textual data. We execute the following code to perform tokenization and remove stopwords:

import nltk

from nltk import word_tokenize

from nltk.corpus import stopwords

def clean(text = '', stopwords = []):



#tokenize

tokens = word_tokenize(text.strip())

#lowercase

clean = [i.lower() for i in tokens]

#remove stopwords

clean = [i for i in clean if i not in stopwords]

#remove punctuation

punctuations = list(string.punctuation)

clean = [i.strip(''.join(punctuations)) for i in clean if i not in punctuations]

return " ".join(clean)



df['clean'] = df['description'].apply(str) #make sure description is a string

df['clean'] = df['clean'].apply(lambda x: clean(text = x, stopwords = stopwords.words('english')))

Finally, we obtain a clean column which contains cleaned English descriptions, ready for analysis:

df['clean'].head(5)

0 roadmap becoming web developer 2017

1 base repository imad v2 course application ple…

2 decrypted content eqgrp-auction-file.tar.xz

3 shadow brokers lost translation leak

4 learn design large-scale systems prep system d...

Numerical data

For numerical data, we will check statistically both what the distribution of values is and whether there are any missing values:

df[['watchers_count','size','forks_count','open_issues']].describe()

Mining social media data 2

We see that there are no missing values in all four variables: watchers_countsizeforks_count, and open_issues. The watchers_count varies from 0 to 20,792 while the minimum number of forks is 33 and goes up to 2,589. The first quartile of repositories has no open issues while top 25% have more than 12. It is worth noticing that, in our dataset, there is a repository which has 458 open issues.

Once we are done with the pre-processing of the data, our next step would be to analyze it, in order to get actionable insights from it.

If you found this article to be useful, stay tuned for Part 2, where we perform analysis on the processed GitHub data and determine the top trending technologies. Alternatively, you can check out the book Python Social Media Analytics, to learn how to get valuable insights from various social media sites such as Facebook, Twitter and more.

Python Social media analytics

 

 

Data Science Enthusiast. A massive science fiction and Manchester United fan. Loves to read, write and listen to music.

LEAVE A REPLY

Please enter your comment!
Please enter your name here