The last week saw word leak that Google had agreed to license Reddit’s massive corpus of billions of posts and comments to help train its large language models. Now, in a recent Securities and Exchange Commission filing, the popular online forum has revealed that it will bring in $203 million from that and other unspecified AI data licensing contracts over the next three years.
Reddit’s Form S-1—published by the SEC late Thursday ahead of the site’s planned stock IPO—says the company expects $66.4 million of that data-derived value from LLM companies to come during the 2024 calendar year. Bloomberg previously reported the Google deal to be worth an estimated $60 million a year, suggesting that the three-year deal represents the vast majority of its AI licensing revenue so far.
Google and other AI companies that license Reddit’s data will receive “continuous access to [Reddit’s] data API as well as quarterly transfers of Reddit data over the term of the arrangement,” according to the filing. That constant, real-time access is particularly valuable, the site writes in the filing, because “Reddit data constantly grows and regenerates as users come and interact with their communities and each other.”
“Why pay for the cow…?”
While Reddit sees data licensing to AI firms as an important part of its financial future, its filing also notes that free use of its data has already been “a foundational part of how many of the leading large language models have been trained.” The filing seems almost bitter in noting that “some companies have constructed very large commercial language models using Reddit data without entering into a license agreement with us.”
That acknowledgment highlights the still-murky legal landscape over AI companies’ penchant for scraping huge swathes of the public web for training purposes, a practice those companies defend as fair use. And Reddit seems well aware that AI models may continue to hoover up its posts and comments for free, even as it tries to sell that data to others.



Loading comments...