OpenAI: Training AI with copyrighted materials is inevitable
Submission to the UK House of Lords says use of protected material is unavoidable for training modern AI models | ChatGPT creator contends that the broad spectrum of copyrighted expressions is critical for creating effective AI | Lords urge greater use of existing licensing models.
OpenAI has said that it would be “impossible” to not train AI models using copyrighted materials.
In written evidence to the UK House of Lords Communications and Digital Select Committee’s inquiry into large language models (LLMs), sent on December 5, 2023 but released in the new year, OpenAI claimed to “respect the rights of content creators” and “actively” collaborate with them to improve creative opportunities.
However, “it would be impossible to train today’s leading AI models without using copyrighted materials,” said the company.
This is due to copyright today covering virtually every sort of human expression—including blog posts, photographs, forum posts, scraps of software code, and government documents, OpenAI continued.
Limiting training data to public domain books and drawings created more than a century ago would not provide AI systems to meet the needs of today’s citizens, OpenAI added.
OpenAI also claimed to provide an “easy way” to not allow its GPTBot web crawler to access a site, and has put in place an opt-out process for creators who want to exclude their images from future Dall·e training datasets.
Asked for his response to the submission, The Earl of Devon, full name Charles Courtenay, a peer focusing on science and technology in the House of Lords, explained that OpenAI training its models on public domain works “undoubtedly includes material protected by copyright and other IP rights, given such protection fosters the diversity and breadth of human intelligence.”
While OpenAI did not state the basis for their belief that AI training does not infringe copyright, Courtenay suggested that the partnership agreements OpenAI has with publishers likely include license provisions, indicating a recognition of copyright considerations.
Courtenay, who is also a partner at Michelmores, highlighted what he called “encouraging developments” within OpenAI's approach mentioned in the submission, such as negotiations with the creative sector and the introduction of opt-outs.
The opt-out functionality “suggests that OpenAI accepts that they are obliged to allow rights holders to withhold their consent to the use of their creative output for the training of AI models,” Courtenay added.
Courtenay considered these to show that the IP regime can operate effectively both to regulate and support the important development of generative AI, while protecting the creative rights of individuals.
Lord Clement-Jones, full name Timothy Clement-Jones, believes UK law “forbids” the ingestion by generative AI/large language models of original content which is in copyright.
The UK has “an extremely well-proven and widely used licensing regime for copyright content,” he explained, adding that companies such as OpenAI should make use of it when wanting to use material for training purposes.
This ensures that AI developers have access to what they need and rights holders are properly rewarded, highlighted Clement-Jones.
OpenAI named in various lawsuits
In December 2023, The New York Times accused OpenAI and Microsoft of copyright infringement, alleging the pair had copied “millions” of its articles to train their chatbots.
The news group further alleged that OpenAI and Microsoft used this work to develop and commercialise their generative AI products without obtaining its permission.
In a blog post, published January 8, OpenAI responded expressing its “surprise and disappointment” at the lawsuit, accusing the publication of “not telling the full story.”
Additionally, the ChatGPT developer hoped for a “constructive partnership” with the newspaper in the future.
OpenAI is also involved in a lawsuit filed by a group of novelists, playwrights and screenwriters, in September 2023.
The plaintiffs, including Michael Chabon and his wife Ayelet Waldman, David Henry Hwang, Matthew Klam, and Rachel Louise Snyder, sued OpenAI for allegedly incorporating their copyrighted works in datasets used to train its GPT models, which power ChatGPT.
The Authors Guild and 17 of its members, including novelist and former lawyer and politician John Grisham (author of The Pelican Brief), filed a complaint in September 2023, alleging copyright infringement by producing “accurately generated summaries” of their works when prompted.
Jodi Picoult (My Sister’s Keeper), George RR Martin (Game of Thrones), Jonathan Franzen (The Corrections), and David Baldacci (Camel Club) were also among the plaintiffs.
US comedian, actress and writer Sarah Silverman and two other authors sued OpenAI and Meta for allegedly copying content from their books to train their AI software, in July 2023.
Already registered?
Login to your account
If you don't have a login or your access has expired, you will need to purchase a subscription to gain access to this article, including all our online content.
For more information on individual annual subscriptions for full paid access and corporate subscription options please contact us.
To request a FREE 2-week trial subscription, please signup.
NOTE - this can take up to 48hrs to be approved.
For multi-user price options, or to check if your company has an existing subscription that we can add you to for FREE, please email Adrian Tapping at atapping@newtonmedia.co.uk