AI coding using open source models has advantages in transparency and the ability to customize them, but hard questions about licensing and reliability were not fully answered at a QCon London session presented by Loubna Ben Allal, a machine learning engineer at AI collaboration company Hugging Face.
QCon London is a vendor-neutral conference for software developers and architects, with two tracks this year on how AI impacts software development. Will AI put developers out of a job? Few here think so; but few doubt the impact of the technology.
AI for coding came to the fore when GitHub introduced Copilot in 2021, said Ben Allal. “This was a huge breakthrough in the field because this model was so much better than all the other code completion models before it.”
There were issues though with Copilot and also with AWS Code Whisperer and other similar approaches, said Ben Allal. “They were only available to an API, so you don’t have the mode, you can’t use the model to fine tune it on your own use case. You also don’t have information on the data that was used to train these models, so there isn’t a lot of data transparency, and the code to do the training and data processing is not available.”
Ben Allal is on the team for BigCode and StarCoder, described as “an open scientific collaboration run by Hugging Face and ServiceNow Research, focused on open and responsible development of LLMs for code.”
This work is the product of the community, she said, and scores “pretty high” when compared to other AI coding resources. “The pillars of this project are three. Full data transparency. For example, if you want to use our models, you can know exactly on which data they were trained. This data is public and you can inspect it. We also open source the code for processing the data sets and training the models, to make our work reproducible, and the model weights are released and with a commercially-friendly license,” she said.
How was the training data sourced? “We basically scraped all of GitHub and then we filtered the data sets for licenses we can use and then we did additional filtering like removing files that look similar,” said Ben Allal. There is also an opt-out; owners of repositories who do not want their code to be used can complete a form and “we’ll make sure to not use your data,” she told attendees.
Hugging Face has released the free Hugging Chat, which is “like ChatGPT, but only uses open source models,” said Ben Allal.
QCon attendees had some hard questions though. “All the code that you suck in somehow from GitHub, you say you take care that the licenses are OK. So I guess there will be no GPL code in your Large language models? But what about other licenses?” asked one, noting that “almost all the licenses I know say that you must leave intact the copyright header and that you must add the license file to the code that you’re producing.”
Ben Allal said that when results are generated, the data set is checked and “if we find that it is an exact copy of something that was in the data set, we have a red alert and if you click on the link you can find exactly which repository that was from and then you can attribute the author.”
There is an underlying question though which is at what point generated code can be considered a copy of someone else’s code, and when it becomes perhaps a new creation not subject to the license.
Does Hugging Face have an agreement with GitHub that it can scrape the data? “There’s no agreement between us and GitHub because we only use the repositories that are public,” said Ben Allal.
How does Hugging Face identify code of low quality that perhaps should not be included? It is a hard problem, Ben Allal indicated. The team experimented with training only on five-star repositories, but “this significantlyreduced the size of the data. And we ended up with a model that was the worst of all the models we trained.” The team does use some filters to remove auto-generated files, she said, as well as filters for secrets or personally identifiable data.
Another question related to what happens when the model is trained on data that is itself generated by AI. Could the model collapse? This is in reference to a study which found that “as a model is trained recursively on data generated from previous generations of itself over time, its performance degrades until the model eventually becomes completely useless.”
The answer was inconclusive. “We haven’t seen that that happen … maybe in the future,” said Ben Allal.