Like sniffing code? GitHub has a code search challenge just for you…

Like sniffing code? GitHub has a code search challenge just for you…

GiHub is calling on data scientists to help it challenge one of the most pressing issues facing tech – how to search for code for examination or reuse.

The Microsoft offshoot has launched the CodeSearchNet challenge to evaluate and accelerate progress on code search models.

OK, it might not be the most pressing issue facing tech, but as GitHub machine learning engineer, Hamel Husain, wrote in a blog announcing the challenge, “Searching for code to reuse, call into, or to see how others handle a problem is one of the most common tasks in a software developer’s day.”

Making it easier to sniff out relevant code could relieve developers of a large amount of grunt work – and associated frustration, given the limitations of current approaches – freeing them up to tackle other problems.

As the paper detailing the challenge puts it, “Semantic code search is the task of retrieving relevant code given a natural language query. While related to other information retrieval tasks, it requires bridging the gap between the language used in code (often abbreviated and highly technical) and natural language more suitable to describe vague concepts and ideas.

But, Husain continued, “Search engines for code are often frustrating and never fully understand what we want, unlike regular web search engines.”

To spur researchers, GitHub has worked with machine learning tracking specialists Weights & Biases to release the CodeSearchNet Challenge evaluation environment and leaderboard, together with a “large” dataset to help data scientists build models, plus “several baseline models showing the current state of the art”.

The dataset includes “functions with associated documentation written in Go, Java, JavaScript, PHP, Python, and Ruby from open source projects on GitHub.” According to GitHub, the dataset spans six million methods overall, a third of which include associates documentation.

The GitHub team have produced an initial set of code search queries, and had programmers “annotate the relevance of potential results”.

“We want to expand our evaluation dataset to include more languages, queries, and annotations in the future,” wrote Husain. “As we continue adding more over the next few months, we aim to include an extended dataset for the next version of CodeSearchNet Challenge in the future.”