Like sniffing code? GitHub has a code search challenge just for you... • DEVCLASS

Like sniffing code? GitHub has a code search challenge just for you…

By Team Devclass

September 27, 2019

Like sniffing code? GitHub has a code search challenge just for you…

GiHub is calling on data scientists to help it challenge one of the most pressing issues facing tech – how to search for code for examination or reuse.

The Microsoft offshoot has launched the CodeSearchNet challenge to evaluate and accelerate progress on code search models.

OK, it might not be the most pressing issue facing tech, but as GitHub machine learning engineer, Hamel Husain, wrote in a blog announcing the challenge, “Searching for code to reuse, call into, or to see how others handle a problem is one of the most common tasks in a software developer’s day.”

Making it easier to sniff out relevant code could relieve developers of a large amount of grunt work – and associated frustration, given the limitations of current approaches – freeing them up to tackle other problems.

As the paper detailing the challenge puts it, “Semantic code search is the task of retrieving relevant code given a natural language query. While related to other information retrieval tasks, it requires bridging the gap between the language used in code (often abbreviated and highly technical) and natural language more suitable to describe vague concepts and ideas.

But, Husain continued, “Search engines for code are often frustrating and never fully understand what we want, unlike regular web search engines.”

To spur researchers, GitHub has worked with machine learning tracking specialists Weights & Biases to release the CodeSearchNet Challenge evaluation environment and leaderboard, together with a “large” dataset to help data scientists build models, plus “several baseline models showing the current state of the art”.

The dataset includes “functions with associated documentation written in Go, Java, JavaScript, PHP, Python, and Ruby from open source projects on GitHub.” According to GitHub, the dataset spans six million methods overall, a third of which include associates documentation.

The GitHub team have produced an initial set of code search queries, and had programmers “annotate the relevance of potential results”.

“We want to expand our evaluation dataset to include more languages, queries, and annotations in the future,” wrote Husain. “As we continue adding more over the next few months, we aim to include an extended dataset for the next version of CodeSearchNet Challenge in the future.”

Like sniffing code? GitHub has a code search challenge just for you…

GitLab warms up investors for winter release of agentic AI flavoured Duo Workflow

JetBrains previews official VS Code language server for Kotlin, unveils fresh language features at K...

The hidden cost of dev stack diversity within an enterprise: 'Engineering chaos'

More React, more app-like: GitHub engineer outlines future UI for its DevOps platform

Tailwind CSS 4.0 released with 'ground-up rewrite' for faster Rust-powered build

How should development environments be standardized? Coder report highlights wide variations

GitHub Git downtime caused by bad configuration update

GitHub debuts limited Copilot free tier in a crowded market

Community plans to fork Puppet, unhappy with Perforce changes to open-source project

ISO C++ Chair Herb Sutter leaves Microsoft, declares forthcoming C++ 26 'most impactful release sinc...

Gitpod discontinues "journey of experiments, failures and dead-ends" with Kubernetes

Kubernetes 1.31 now a 'truly vendor-neutral platform' thanks to removal of built-in cloud provider c...

ABOUT US

FOLLOW US