Stack Overflow strives to protect community content from AI firms, striking mods say “re-enable the data dumps”

Stack Overflow strives to protect community content from AI firms, striking mods say “re-enable the data dumps”

Stack Exchange, whose collection of Q&A sites includes the developer-favorite Stack Overflow, has quietly ended a long-standing policy of uploading its community-contributed data to the internet Archive – causing striking moderators to add “the data dumps must be re-enabled” to their list of demands.

This was done, said CTO Jody Bailey, “to protect Stack Overflow data from being misused by companies building LLMs.” Large Language Models [LLMs] are used to teach AI how to answer queries. The content though is contributed under a Creative Commons share-alike (CC by-SA 4.0) license, which specifically allows the public to “copy and redistribute the material in any medium or format” and to “remix, transform, and build upon the material for any purpose, even commercially.” The CC by-SA 4.0 license for contributed content is specified in the Stack Overflow terms of service.

The Internet Archive data dump page for the Stack Exchange data states that “All user content contributed to the Stack Exchange network is cc-by-sa 4.0 licensed, intended to be shared and remixed.”

The data dump is normally refreshed quarterly, but the June 2023 dump is missing. When this was queried, Bailey answered that while working on the strategy to protect the data from AI companies, “we decided to stop the dump until we could put guardrails in place.” Bailey says the company is looking for a solution that “will allow individuals access to the data while preventing misuse by organizations looking to profit from the work of our community.” Other than Stack Exchange, we presume.

Stack Overflow’s striking moderators, who protested last week against being told to relax moderation of suspected AI-generated content, have now added restoration of the data dump to their list of demands, arguing that “The data dumps of Stack Exchange content serve to further the goals of free knowledge-sharing. The content posted to the Stack Exchange network was done so to further that goal and with the understanding that it would be freely distributed to anyone seeking knowledge.”

Regarding the impact of the strike, the moderators claim that “113 out of 538 total Stack Exchange network moderators” have signed an open letter stating that they will cease moderation “until this matter is resolved satisfactorily.” The mods have elected three representatives to negotiate with Stack Exchange Inc.

The Stack Overflow drama has increased interest in an alternative Q&A site called Codidact, which was set up several years ago by former Stack Exchange moderators, among others, to be “free from the politics and shenanigans of private, profit-focused companies.” Codidact is owned by a non-profit foundation.

In regard to AI content, Codidact states that “the use of AI-generated content, particularly Large Language Model (LLM) generated content, constitutes an abuse of the platform, and moderators are empowered to remove such content and issue warnings as they see fit.” What it lacks is both traffic and content. It is hard to unseat a deeply embedded service like Stack Overflow; and in fact it is AI and the likes of GitHub Copilot and Open AI ChatGPT that have unsettled the company, not competing Q&A sites.