Yesterday, the team at GitHub along with its partners from Weights & Biases introduced the CodeSearchNet challenge evaluation environment and leaderboard. The team is also releasing a large dataset to help data scientists in building models for this task and several baseline models that highlight the current state of the art.
Semantic code search involves retrieving relevant code when a natural language query is given. While dealing with other information retrieval tasks, it needs to bridge the gap between the language used in code and natural language. Also, the standard information retrieval methods don’t work effectively in the code search domain because usually there is little shared vocabulary between search terms and results. Evaluating methods for this task is very difficult, as there are no substantial datasets that were made for this task.
Considering these issues and to evaluate the progress on code search, the team is releasing CodeSearchNet Corpus and they are presenting the CodeSearchNet Challenge. The CodeSearchNet Challenge consists of 99 natural language queries and around 4k expert relevance annotations.
The CodeSearchNet Corpus
The CodeSearchNet corpus contains around 6 million functions from open-source code spanning six programming languages including Go, Java, Python, JavaScript, PHP, and Ruby. For collecting a large dataset of functions, the team used TreeSitter infrastructure, a parser generator tool and an incremental parsing library. The team is also releasing its data preprocessing pipeline for others to use as it will be a starting point in applying machine learning to code. This data is not directly related to code search but if used with related natural language description, it can help in training models.
CodeSearchNet corpus contains automatically generated query-like natural language for around 2 million functions. It also includes the metadata that indicates the original location where the data was found.
CodeSearchNet Corpus collection
The team collects the corpus from publicly available open-source non-fork GitHub repositories and uses libraries.io for identifying all projects which are used by at least one other project. They further sort these projects based on their ‘popularity’ by checking the number of stars and forks. The team removes the projects that do not have a license or whose license does not allow the re-distribution of parts of the project.
The team has also tokenized all the functions, including Go, JavaScript, Python, Java, PHP and Ruby with the help of TreeSitter. For generating the training data for the CodeSearchNet Challenge, the team considers those functions in the corpus hat have documentation associated with them.
The CodeSearchNet Challenge
The team collected an initial set of code search queries for evaluating code search models. They started by collecting the common search queries that had high click-through rates from Bing and then combined these with queries from StaQC. The team manually filtered out those queries that were clearly ‘technical keywords’ for obtaining a set of 99 natural language queries.
The team then used a standard Elasticsearch installation and baseline models for obtaining 10 results per query from their CodeSearchNet Corpus. They then asked data scientists, programmers, and machine learning researchers for annotating the results for relevance to the query. For evaluating the CodeSearchNet Challenge, a method should return a set of results from CodeSearchNet Corpus for each of 99 pre-defined natural language queries.