GitHub trials machine learning so you can mind your language

CI/CD
DevOps

By Team Devclass

July 4, 2019

GitHub trials machine learning so you can mind your language

GitHub is trialling a machine-learning powered system to identify the babel of languages across the code repo platform.

You might wonder whether this is a big deal – surely people know what language they’re using, and that’s all that matters.

But announcing the project this week, GitHub machine learning engineer, Kavita Ganesan wrote, “When some code is pushed to a repository, it’s important to recognize the type of code that was added for the purposes of search, security vulnerability alerting, and syntax highlighting—and to show the repository’s content distribution to users.”

Filenames and extensions aren’t enough, she continued, with some languages associated with multiple extensions, while others are ambiguous and found across multiple languages. And of course, some files aren’t given extensions.

Furthermore, code snippets may be included in Readmes, issues and pull requests. It’s this last factor that particularly undermines GitHub’s existing language recognition tool, Linguist.

So, GitHub engineers have built a new tool, OctaLinuga (which may or may not be really bad Latin for eight tongues), built with Python, Keras and TensorFlow.

The model itself is described as a two layer artificial neural network, built using Keras and TensorFlow. This produces a “51-dimensional output which represents the predicted probability that the given code is written in each of the top 50 GitHub languages plus the probability that it is not written in any of those.”

The software was trained to recognise the top 50 languages on GitHub on files retrieved from Rosetta code. This was augmented with additional sources for some languages which only presented a limited number of files.

Adding new languages beyond the initial 50, is “fairly straightforward” requiring a bulk of files in the new language, which are run through the platform.

The Microsoft tentacle describes it as at “advanced prototyping stage”, with the classification engine being robust and reliable, if not supporting all the languages on GitHBub. Recognising code snippets, can be achieved with “a small modification to our machine learning engine.”

“It wouldn’t be too far fetched to take the model to the stage where it can reliably detect and classify embedded languages. “

Unsurprisingly, Ganesan said GitHub is considering whether to open source the project, and “would love to hear from the community”.

AWS combines "building block" blueprints with CodeCatalyst for rapid project creation including DevO...

Atlassian takes another step toward full DevOps automation

GitHub autofix progresses to public beta: insecure code corrected with AI, but only for enterprise

Secret leakage in public GitHub repositories increasing, claims new report

Test launch of TEA open source reward project clouded by repository spam attack

From Docker to Dagger: Solomon Hykes on modernisation of the DevOps pipeline

Docker introduces Build Cloud for accelerated local development

Enterprises struggle with Agile methodology, reports long-standing survey of practitioners

Spotlight on GitHub self-hosted runners again as researcher demonstrates attack on PyTorch code

PyPy moves from Mercurial, says 'open source has become synonymous with GitHub'