Stack Exchange restricts access to dump of user-contributed data, critics complain this contradicts license

Development

By Tim Anderson

July 30, 2024

Stack Exchange, whose best-known site is the developer question and answer resource Stack Overflow, will restrict access to its user-contributed data dump behind a login and agreement not to use the content to train AI models, despite the Creative Commons license that allows the public to “remix, transform, and build upon the material for any purpose, even commercially,” subject to attribution.

The company formerly posted user-contributed data to the Internet Archive every three months, most recently in April 2024. The data is free to download, and accompanying text notes that “all user content contributed to the Stack Exchange network is CC-by-SA 4.0 licensed, intended to be shared and remixed.” The text does note the requirement for attribution, including the author and a link back to the original question.

Now Stack Exchange has informed users about a change to its data dump process, with the policy updated late last week in partial response to contributor complaints. The key changes are that the data dump will be on the Stack Exchange site; accessed via the user profile, which requires login; and that downloaders must agree that “the file is being provided to me for my own use and for projects that do not include training a large language model [LLM].”

The update states that both the product and legal teams have signed off on this modified language.

Users who do not comply may have their access to future downloads removed.

A previous proposed version of the download agreement required agreement to “use this file for non-commercial use;” this has now been narrowed to specify LLM training.

The post has been received negatively by contributors, with one highly upvoted comment claiming that the policy does not comply with the CC-BY-SA 4.0 license, specifically the part that says “you may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.”

The wording of the download agreement also appears to be at odds with the wording of the rest of the announcement, which states that “We are requesting that if you intend to use the dump for a commercial purpose, you consider joining the socially responsible AI movement and giving back to the community.” This appears to differ from what the click-to-agree conditions state.

An FAQ on the matter says that “we are attempting to protect the long-term viability of the Stack Exchange network” and complains that “companies have scraped or otherwise ingested Stack Overflow and Stack Exchange data to train models without proper attribution.”

In February Stack Overflow agreed to integrate with Google’s Gemini for “new AI-powered features” both on Stack Overflow and in Gemini’s output.

A prominent Stack Exchange member set out a plea to “save the data dump,” stating that “they’re still selling the data dump to genAI – they just don’t want genAI companies to get it for free. They’re capitalising on the community’s data, while making it harder for the same community to own our own data.”

It is obvious that developers using AI to answer coding questions or to generate code are less likely to visit the Stack Overflow site, causing a decline in traffic. It is also obvious that the commercial value of Stack Overflow content is diminished if the same content is freely available, even for commercial use.

These are business concerns for Stack Exchange but that does not change the license under which content is contributed, which is CC-BY-SA 4.0.

There are also unresolved questions throughout the industry regarding when AI-driven output is new content, and when it is more akin to search results that require attribution.

Another issue is that as fewer developers use the site directly, the number and quality of answers will diminish, reducing its value for LLM training and in general.

Might the data dump end up on the Internet Archive regardless? “You do realize that there is nothing you can do to prevent your community from simply maintaining the archive.org mirror for you?” said a developer, with the response from Philippe Beaudette, VP of community, “I do, yes. My hope is that over time, they see that there’s no need.”

We have asked Stack Exchange for further comment on the new policy.

Stack Exchange restricts access to dump of user-contributed data, critics complain this contradicts license

Microsoft shovels extra Copilot features into VS Code amid dev complaints of 'more AI bloat'

Despite 30 months work, core developer says Python's JIT compiler is often slower than the interpret...

Things Go better with telemetry: Microsoft adds phone home to its Go build

Zig lead makes 'extremely breaking' change to std.io ahead of Async and Await's return

Microsoft SQL Server MCP tool: 'Leap in data interaction' or limited and frustrating?

Cloudflare container platform in public preview with scale to zero pricing, some initial limitations

Microsoft to finally expunge the Azure AD Graph API

Avalonia UI sponsorship 'completely removes' open source vs commercial conflict claims CEO

Google positions itself for 'next decade' of AI as Gemini CLI arrives with generous free tier

"Serious" MySQL bug celebrates 20 years unfixed - another reason to switch to PostgreSQL?

React ecosystem is fractured but Vercel is not the villain, argues Redux maintainer

CloudBees opens MCP server so agents can infiltrate DevOps

ABOUT US

FOLLOW US