Developer Q&A site Stack Overflow will partner with OpenAI, with ChatGPT using Stack Overflow data and OpenAI models coming to OverflowAI, a commercial service – while a Stack Overflow contributor with more than 1 million reputation points has confessed to posting around 1200 answers “based on generative AI content,” contrary to the site’s guidelines.
Stack Overflow’s partnership with OpenAI is non-exclusive and echoes that with Google earlier this year.
OpenAI will use OverflowAPI (not to be confused with OverflowAI) to train its models on Stack Overflow’s public dataset. There is also a hint that OpenAI is paying a significant amount, as the release states that the deal will “enable Stack Overflow to continue to reinvest in community-driven features.”
GitHub Copilot uses OpenAI technology and models, so it is likely that the deal will improve the integration between Copilot’s coding features and Stack Overflow answers.
The relationship between Stack Overflow and AI is a complex one. The availability of AI assistance within code editors and elsewhere appears to have reduced traffic to Stack Overflow, which may eventually diminish the value of the data, since it depends on a high level of activity.
Further, use of generative AI to post Stack Overflow answers is banned by policy, and the company cites human validation, where answers are upvoted by users, as one of the benefits of its data versus simply training models on public code. “Users who ask questions on Stack Overflow expect to receive an answer authored and vetted by a human,” states the help article which expands on the policy of banning AI responses, adding that some now come to the site precisely because AI has failed to answer their question.
Despite this ban, there are plenty of cases where contributors post AI-generated answers, not least VonC, a site member for mroe than 15 years with a reputation score of over a million. A few days ago he confessed that between March 2023 and April 2024 he posted around 1850 answers of which “about two-thirds were based on generative AI content.” He said that he will no longer use AI tools and will rely “solely on my own knowledge and expertise in future contributions.” The AI-generated answers have now been deleted by Stack Overflow moderators.
Inviting developers to use AI for coding problems, other than when trying to help on Stack Overflow, is a difficult balancing act, particularly when all contributions are freely made by the community. On the other hand, there are risks in training AI on AI-generated data, and according to a study “there is emerging evidence suggesting that retraining a generative AI model on its own outputs can lead to various anomalies in the model’s later outputs.”
Other issues include whether Stack Overflow may hasten its own decline by integrating with services that do not require developers to visit its site; while some contributors are concerned that they get no benefit from the OpenAI deal or want to opt out of having their content feed its answers.
Stack Overflow’s choices are constrained by the fact that answers are posted under the Creative Commons Attribution-ShareAlike license, which means that the content can be remixed and transformed for any purpose, subject to attribution, and that the transformed content is subject to the same license. Deals with AI companies such as OpenAI and Google may help ensure that attribution, though the nature of AI processing means that it is not always obvious how one specific piece of data has been used.