Home Data Tutorials How to Mine Popular Trends on GitHub using Python – Part 2

How to Mine Popular Trends on GitHub using Python – Part 2

December 27, 2017 - 12:00 am

3034

1 min read

[box type=”note” align=”” class=”” width=””]This article is an excerpt taken from the book Python Social Media Analytics, written by Siddhartha Chatterjee and Michal Krystyanczuk. In this book, you will find widely used social media mining techniques for extracting useful insights to drive your business.[/box]

In Part 1 of this series, we gathered the GitHub data for analysis. Here, we will analyze that data as per our requirements, to get interesting insights on the highest trending and popular tools and languages on GitHub.

We have seen so far that the GitHub API provides interesting sets of information about the code repositories and metadata around the activity of its users around these repositories. In the following sections, we will analyze this data to find out which are the most popular repositories through the analysis of its descriptions and then drilling down to the watchers, forks, and issues submitted on the emerging technologies. Since, technology is evolving so rapidly, this approach could help us to stay on top of the latest trending technologies. In order to find out what are the trending technologies, we will perform the analysis in a few steps:

Identifying top technologies

First of all, we will use text analytics techniques to identify what are the most popular phrases related to technologies in repositories from 2017. Our analysis will be focused on the most frequent bigrams.

We import a nltk.collocation module which implements n-gram search tools:

import nltk

from nltk.collocations import *

Then, we convert the clean description column into a list of tokens:

list_documents = df['clean'].apply(lambda x: x.split()).tolist()

As we perform an analysis on documents, we will use the method from_documents instead of a default one from_words. The difference between these two methods lies in then input data format. The one used in our case takes as argument a list of tokens and searches for n-grams document-wise instead of corpus-wise. It protects against detecting bi-grams composed of the last word of one document and the first one of another one:

bigram_measures = nltk.collocations.BigramAssocMeasures()

bigram_finder = BigramCollocationFinder.from_documents(list_documents)

We take into account only bi-grams which appear at least three times in our document set:

bigram_finder.apply_freq_filter(3)

We can use different association measures to find the best bi-grams, such as raw frequency, pmi, student t, or chi sq. We will mostly be interested in the raw frequency measure, which is the simplest and most convenient indicator in our case.

We get top 20 bigrams according to raw_freq measure:

bigrams = bigram_finder.nbest(bigram_measures.raw_freq,20)

We can also obtain their scores by applying the score_ngrams method:

scores = bigram_finder.score_ngrams(bigram_measures.raw_freq)

All the other measures are implemented as methods of BigramCollocationFinder. To try them, you can replace raw_freq by, respectively, pmi, student_t, and chi_sq. However, to create a visualization we will need the actual number of occurrences instead of scores. We create a list by using the ngram_fd.items() method and we sort it in descending order.

ngram = list(bigram_finder.ngram_fd.items())

ngram.sort(key=lambda item: item[-1], reverse=True)

It returns a dictionary of tuples which contain an embedded tuple and its frequency. We transform it into a simple list of tuples where we join bigram tokens:

frequency = [(" ".join(k), v) for k,v in ngram]

For simplicity reasons we put the frequency list into a dataframe:

df=pd.DataFrame(frequency)

And then, we plot the top 20 technologies in a bar chart:

import matplotlib.pyplot as plt

plt.style.use('ggplot')

df.set_index([0], inplace = True)

df.sort_values(by = [1], ascending = False).head(20).plot(kind = 'barh')

plt.title('Trending Technologies')

plt.ylabel('Technology')

plt.xlabel('Popularity')

plt.legend().set_visible(False)

plt.axvline(x=14, color='b', label='Average', linestyle='--', linewidth=3)

for custom in [0, 10, 14]:

plt.text(14.2, custom, "Neural Networks", fontsize = 12, va = 'center',

bbox = dict(boxstyle='square', fc='white', ec='none'))

plt.show()

We’ve added an additional line which helps us to aggregate all technologies related to neural networks. It is done manually by selecting elements by indices, (0,10,14) in this case. This operation might be useful for interpretation.

Mining Social media data part 2

The preceding analysis provides us with an interesting set of the most popular technologies on GitHub. It includes topics for software engineering, programming languages, and artificial intelligence. An important thing to be noted is that technology around neural networks emerges more than once, notably, deep learning, TensorFlow, and other specific projects. This is not surprising, since neural networks, which are an important component in the field of artificial intelligence, have been spoken about and practiced heavily in the last few years. So, if you’re an aspiring programmer interested in AI and machine learning, this is a field to dive into!

Programming languages

The next step in our analysis is the comparison of popularity between different programming languages. It will be based on samples of the top 1,000 most popular repositories by year.

Firstly, we get the data for last three years:

queries = ["created:>2017-01-01", "created:2015-01-01..2015-12-31",

"created:2016-01-01..2016-12-31"]

We reuse the search_repo_paging function to collect the data from the GitHub API and we concatenate the results to a new dataframe.

df = pd.DataFrame()

for query in queries:

data = search_repo_paging(query)

data = pd.io.json.json_normalize(data)

df = pd.concat([df, data])

We convert the dataframe to a time series based on the create_at column

df['created_at'] = df['created_at'].apply(pd.to_datetime)

df = df.set_index(['created_at'])

Then, we use aggregation method groupby which restructures the data by language and year, and we count the number of occurrences by language:

dx = pd.DataFrame(df.groupby(['language',

df.index.year])['language'].count())

We represent the results on a bar chart:

fig, ax = plt.subplots()

dx.unstack().plot(kind='bar', title = 'Programming Languages per Year', ax= ax)

ax.legend(['2015', '2016', '2017'], title = 'Year')

plt.show()

Mining Social media part 2

The preceding graph shows a multitude of programming languages from assembly, C, C#,

Java, web, and mobile languages, to modern ones like Python, Ruby, and Scala. Comparing over the three years, we see some interesting trends. We notice HTML, which is the bedrock of all web development, has remained very stable over the last three years. This is not something that will not be replaced in a hurry. Once very popular, Ruby now has a decrease in popularity. The popularity of Python, also our language of choice for this book, is going up. Finally, the cross-device programming language, Swift, initially created by Apple but now open source, is getting extremely popular over time. It could be interesting to see in the next few years, if these trends change or hold true for long.

Programming languages used in top technologies

Now we know what are the top programming languages and technologies quoted in repositories description. In this section we will try to combine this information and find out what are the main programming languages for each technology.

We select four technologies from previous section and print corresponding programming languages. We look up the column containing cleaned repository description and create a set of the languages related to the technology. Using a set will assure that we have unique Values.

technologies_list = ['software engineering', 'deep learning', 'open source', 'exercise practice']

for tech in technologies_list:

print(tech)

print(set(df[df['clean'].str.contains(tech)]['language']))

software engineering

{'HTML', 'Java'}

deep learning

{'Jupyter Notebook', None, 'Python'}

open source

{None, 'PHP', 'Java', 'TypeScript', 'Go', 'JavaScript', 'Ruby', 'C++'}

exercise practice

{'CSS', 'JavaScript', 'HTML'}

Following the text analysis of the descriptions of the top technologies and then extracting the programming languages for them we notice the following:

You can do a lot more analysis with this GitHub data such as:

Want to know how? You can check out our book Python Social Media Analytics to get a detailed walkthrough of these topics.

Top 6 Cybersecurity Books from Packt to Accelerate Your Career

Your Quick Introduction to Extended Events in Analysis Services from Blog…

Logging the history of my past SQL Saturday presentations from Blog…

Storage savings with Table Compression from Blog Posts – SQLServerCentral

Daily Coping 31 Dec 2020 from Blog Posts – SQLServerCentral

Learning Essential Linux Commands for Navigating the Shell Effectively

Exploring the Strategy Behavioral Design Pattern in Node.js

How to integrate a Medium editor in Angular 8

Implementing memory management with Golang’s garbage collector

How to create sales analysis app in Qlik Sense using DAR…

How to Mine Popular Trends on GitHub using Python – Part 2

Identifying top technologies

Programming languages used in top technologies

LEAVE A REPLY Cancel reply

MobilePro

datapro

Programming

Subscribe to our newsletter