Ink Drop / Shutterstock.com
4 June 2024FeaturesArtificial IntelligenceGiles Parsons

Sticking cheese to pizza with glue: What Reddit's data deal tells us about AI and copyright

Data licensing has become big business but companies like Reddit are selling in a market of unknowns, says Giles Parsons of Browne Jacobson.

AI training raises a lot of complex questions about copyright law and there is an interesting tale to tell by looking at some of the issues that Reddit is facing.

Large Language Models—‘LLMs’—are language AI models. GPT 3.5 and GPT 4 are good examples; these were developed by OpenAI.

They are the most famous LLMs because they are used by OpenAI’s own ChatGPT and by Microsoft’s Copilot. Google, Meta, Anthropic and others all have their own LLMs.

LLMs are expensive to train; the processing power can cost hundreds of millions. Training LLMs also requires data. Volume of data is really important, and quality of data is also highly prized.

Reddit is a website where people can post questions, and create and share answers and other content, including through links. It has more than 82 million daily unique viewers and over 16 billion posts and comments.

‘Karma’ content

Those posts can be ‘upvoted’ by other users, and in doing so, accumulate what Reddit calls ‘karma’. If a link gets a lot of karma, it is something that Reddit’s users have found helpful (or funny), and so can be regarded as good content. So, Reddit is full of content itself but, importantly, it also contains signposts about a substantial amount of good content elsewhere on the internet.

Of course, why the content was upvoted is important; the top Google result for “cheese not sticking to pizza” is a Reddit thread recommending adding glue to the sauce, which may be why Google’s AI briefly recommended the same advice!

OpenAI recognised that Reddit can be used as a content guide when developing GPT2, and explained in their 2018 paper how they implemented that:

“… We scraped all outbound links from Reddit, a social media platform, which received at least 3 karma. This can be thought of as a heuristic indicator for whether other users found the link interesting, educational, or just funny. The resulting dataset, WebText, contains the text subset of these 45 million links”.

WebText was also used to train GPT3, and amounted to 22% of the weight of the training mix. WebText was also used as a predictor of other higher quality documents, and so was used to filter ‘Common Crawl’, which is a huge repository of ‘webcrawl’ data.

How Reddit differs from the NYT

OpenAI has not stated what was used to train their most recent models, GPT-4 and GPT-4o. That may, in part, be because the success of GPT-3 drew attention and lawsuits alleging that use to train AI is copyright infringement. OpenAI is currently embroiled in several disputes in the US, including class action lawsuits and a claim from the New York Times (NYT)

In such circumstances, one can imagine OpenAI’s lawyers counselling against further publications that explain sourcing. Of course, the lack of disclosure could also be because they want to keep information about their training data confidential, or there may be other reasons.

Reddit’s COO Jen Wong has threatened legal proceedings against AI companies that use Reddit to train data. Unlike the NYT, the Reddit content was created not by paid employees and contractors but by its users (again proving the adage that if you are not paying for something, you are the product).

Reddit’s terms allow it to license user created content onwards—as do many other sites hosting user generated content.

However, they don’t assign the copyright from the user, and non-exclusive licensees can only sue in the UK for copyright infringement if the infringement falls within the requirements of section 101A of the Copyright, Designs and Patents Act 1988.

Those requirements include that the licence must be in writing and, importantly, signed. Section 101A would also normally require Reddit to join the copyright owner(s) (ie, its relevant user(s)) to any proceedings.

This assumes that the text of posts is taken. If it is only hyperlinks which are being taken, then they may not be copyright works, and if they are, most will not have been created by Reddit or its users.

If this was litigated in the UK, Reddit might argue that Reddit as a collection of posts is itself protected by database rights. The posts and the links within them could be said to be independent data, arranged in a methodical way, and Reddit may be content that there has been substantial investment in presenting the contents of the database.

Alternatively, it may be more straightforward for Reddit to instead seek to argue that use of Reddit to train an AI breaches its site user agreement and so is simply breach of contract.

Reddit’s key advantage

Reddit is not the biggest social media platform, but some of the largest platforms are owned by tech giants who may not want to argue that one needs to pay for training data as they are also training LLMs themselves. Reddit does not have that concern, and has achieved fee paying licences from AI businesses.

For example, in February, it was announced that Reddit had entered into a licensing agreement with Google. Reuters reported that to be worth about $60 million per year; Reddit’s IPO prospectus said it had entered into licences in January 2024 for $203 million, terms ranging from 2-3 years, and a minimum of $66.4 million in 2024. The prospectus reported losses in 2023 of $90.8 million on revenues of $804 million, so this gets them a potentially significant revenue stream with minimal cost.

That same prospectus discussed Reddit’s AI licensing intent:

“Some companies have constructed very large commercial language models using Reddit data without entering into a licence agreement with us. While we plan to vigorously enforce against such entities, such enforcement activities could take years to resolve, result in substantial expense and divert management’s attention and other resources, and we may not ultimately be successful.”

Sure enough, on May 16, 2024, Reddit announced a further licensing deal with OpenAI.

The fact that a deal has been reached is perhaps not surprising. Silicon Valley is a small world; Sam Altman, the CEO of OpenAI, previously served on Reddit’s board, was briefly Reddit CEO, and owns stock in Reddit.

This is part of a broader trend of AI licensing deals. For example, OpenAI has recently entered into a deal with News Corp, adding to previous deals with the Financial Times, Associated Press, Le Monde and Prisa Media. Could the NYT agree on a price with OpenAI?

It has already spent $1m during the first quarter of 2024 on the litigation.

‘Hard fought’ cases to come

There are also considerations that go beyond IP law. Reddit’s prospectus said that the  US Federal Trade Commission is currently conducting a non-public inquiry into their licensing of user data. Tech giants may also run into competition law considerations if they tried to prevent third parties from using data to train AI.

I will end with an apology, because the most boring conclusion to any article is “watch this space”. 

But any takeaways and conclusions will be temporary when this is such an evolving story. There are a huge amount of complex issues that still need to be solved as court cases work through, as the AI Act comes into force in the EU, and as other countries work out what regulation they will impose.

What I can say is that licensing AI training data has gone from nothing to big business as AI has taken off, so all these issues will be hard fought.

Giles Parsons is a partner at Browne Jacobson.

Already registered?

Login to your account

To request a FREE 2-week trial subscription, please signup.
NOTE - this can take up to 48hrs to be approved.

Two Weeks Free Trial

For multi-user price options, or to check if your company has an existing subscription that we can add you to for FREE, please email Adrian Tapping at atapping@newtonmedia.co.uk


More on this story

Artificial Intelligence
18 March 2024   Reddit says it has entered into licensing arrangements with AI companies worth $203m | FTC wants to ‘learn more’ about plans related to selling content for AI training | Platform reportedly signed deal with Google and others.