GitHub trials machine learning so you can mind your language

GitHub trials machine learning so you can mind your language

GitHub is trialling a machine-learning powered system to identify the babel of languages across the code repo platform.

You might wonder whether this is a big deal – surely people know what language they’re using, and that’s all that matters.

But announcing the project this week, GitHub machine learning engineer, Kavita Ganesan wrote, “When some code is pushed to a repository, it’s important to recognize the type of code that was added for the purposes of search, security vulnerability alerting, and syntax highlighting—and to show the repository’s content distribution to users.”

Filenames and extensions aren’t enough, she continued, with some languages associated with multiple extensions, while others are ambiguous and found across multiple languages. And of course, some files aren’t given extensions.

Furthermore, code snippets may be included in Readmes, issues and pull requests. It’s this last factor that particularly undermines GitHub’s existing language recognition tool, Linguist.

So, GitHub engineers have built a new tool, OctaLinuga (which may or may not be really bad Latin for eight tongues), built with Python, Keras and TensorFlow. 

The model itself is described as a two layer artificial neural network, built using Keras and TensorFlow. This produces a “51-dimensional output which represents the predicted probability that the given code is written in each of the top 50 GitHub languages plus the probability that it is not written in any of those.”

The software was trained to recognise the top 50 languages on GitHub on files retrieved from Rosetta code. This was augmented with additional sources for some languages which only presented a limited number of files.

Adding new languages beyond the initial 50, is “fairly straightforward” requiring a bulk of files in the new language, which are run through the platform.

The Microsoft tentacle describes it as at “advanced prototyping stage”, with the classification engine being robust and reliable, if not supporting all the languages on GitHBub. Recognising code snippets, can be achieved with “a small modification to our machine learning engine.” 

“It wouldn’t be too far fetched to take the model to the stage where it can reliably detect and classify embedded languages. “

Unsurprisingly, Ganesan said GitHub is considering whether to open source the project, and “would love to hear from the community”.