Microsoft subsidiary GitHub surprised the coding community this week with the launch of a code-completion tool that looks to do a little more than offer function arguments in a drop down menu. GitHub Copilot is an extension for Visual Studio Code that is meant to help developers focus on the actual problem they’re trying to solve, by suggesting code lines and even full functions.
Copilot isn’t Microsoft’s first try at ML-aided coding. VS IntelliCode, which MS has been pushing since 2018, already makes use of machine learning models trained on GitHub repositories and devs’ local code to help with bespoke code recommendations, formatting, and argument completion.
For the GitHub Copilot plugin, however, the company takes things a step further by letting the “AI pair programmer” suggest everything from lines to whole functions for its human counterparts. Similar capabilities are already available via tools like Tabnine and kite. GitHub has the advantage of an already large user base to get its own service going — and the support of OpenAI.
According to OpenAI, Copilot makes use of a new AI system called OpenAI Codex to translate natural language into code. This isn’t meant to be the only capability of the contraption, though devs will have to wait until “later this summer” when Codex is released through the OpenAPI API to find out what this entails. Codex is described as a descendent of the language model GPT-3, which is exclusively licensed to Microsoft and came under fire for the cost and potential environmental impact associated with its use and creation amongst other things.
When it comes to computational cost, Copilot — and therefore Codex — seems to follow a similar trajectory. GitHub states that the current preview phase is restricted because of the “state-of-the-art AI hardware” required for the project. Once the free preview phase is over, the company plans to build a commercial version, which should be “available as broadly as possible”.
To get to its suggestions, Copilot is trained on “a selection of English language and source code from publicly available sources, including code in public repositories on GitHub”. While some of the first feedback to the project concerned itself with the rightfulness of this approach, GitHub states that “training machine learning models on publicly available data is considered fair use across the machine learning community”. However the company also points out that this is “a new space” with lots of discussions to be had, so we’ll see how this argument holds up in the coming months.
This will be especially interesting to see, given that there is a slight chance of the tool offering suggestions that are an exact copy of some code from the training set. To make sure this doesn’t turn into a problem, GitHub is building an origin tracker to point out occurrences of verbatim code — putting the onus on the user to decide if it’s fair use or not.
Ownership of the code created with Copilot was a source of discussion on social media. GitHub assigns rights to the person writing the code, since “GitHub Copilot is a tool, like a compiler or a pen”. This also means GitHub doesn’t accept responsibility for the code written with Copilot — so it doesn’t exempt devs from testing and reviewing code.
Those curious to see what pair programming with Copilot feels like can register for the waiting list through GitHub. Developers need to be comfortable with sharing information about events in VS Code tied to their GitHub user account, as well as code snippets — agreeing to additional telemetry data being collected is a prerequisite to join the line. According to GitHub, the data is used to monitor abuse and improve the Copilot VS Code extension as well as “related GitHub products”.