Shutterstock/2425904837
12 March 2024CopyrightLiz Hockley

Nvidia sued over ‘shadow library’ that trains AI models

Authors file class-action suits against Nvidia and Databricks in California | Complaints say LLMs were trained using notorious ‘Books3’ dataset containing thousands of copyrighted works.

Three authors have sued Nvidia and software company Databricks, alleging that the tech firms used a “shadow library” containing thousands of copyrighted works to train AI models.

Abdi Nazemian, Brian Keene and Stewart filed two proposed class-action suits against Nvidia and Databricks at the US District Court for the Northern District of California on Friday, March 8.

The authors took aim at Nvidia’s NeMo Megatron series of large language models (LLMs), which they said was trained on a dataset that included books written by themselves and other class members “without consent, without credit, and without compensation”.

These copyrighted works included Like a Love Story by Nazemian, Ghost Walk by Keene and Last Night at the Lobster by O’Nan, according to the suit.

Library of pirated material ‘removed’

Nvidia allegedly announced the availability of its NeMo Megatron series in September 2022, with the models hosted on a website called Hugging Face.

On this website, it was stated that the models were trained on a dataset known as ‘The Pile’, which contained a component called ‘Books3’, the authors said.

They alleged that Books3 was derived from a copy of the contents of a “shadow library” website known as Bibliotik, which contained approximately 196,640 books—including their own.

Books3 was available on the Hugging Face website until October 2023, the lawsuit claimed, when it was removed “due to reported copyright infringement”.

“Nvidia has admitted training its NeMo Megatron models on a copy of The Pile dataset,” the authors alleged, telling the court that this meant that the tech firm “necessarily also trained its NeMo Megatron models on a copy of Books3, because Books3 is part of The Pile”.

A Nvidia spokesperson told the Wall Street Journal that: “We respect the rights of all content creators and believe we created NeMo in full compliance with copyright law.”

In the separate suit against Databricks, the authors said that Books3 was also used to train a series of LLMs produced by MosaicML, a firm Databricks acquired in July 2023. In that case, Books3 was part of a dataset known as ‘RedPajama’, according to the complaint.

Both suits have been filed by Joseph Saveri Law Firm, which states on its website that it is “at the forefront of generative artificial intelligence litigation”.

Its founder Joseph Saveri has filed suits on behalf of clients challenging ChatGPT, GitHub, OpenAI and others with co-counsel Matthew Butterick—detailed on a dedicated website titled ‘LLM litigation’.

Interesting questions

Commenting on this latest development in the copyright battle over AI, Randy McCarthy, a patent attorney from Hall Estill, said:

“This class action lawsuit against Nvidia is similar to the other existing cases that have been filed against other AI content-generating systems in the areas of text and image generation.

“The question is to what extent do copyright owners have a right to restrict the use of their published works as training set data for a machine learning system? There are cases that have already determined that accessibility to digitally available works, by itself, is not sufficient to constitute copyright infringement.”

“For example, in the Google Books case, accessing and cataloguing the full content of the digital works was viewed as fair use under the Copyright Act and was not deemed as copyright infringement,” McCarthy continued.

“Does this same activity become a copyright infringement because an artificial neural network is used to perform the same actions?

“It will be interesting to find out the answer. Ultimately, it will likely require Congress to solve these various issues, as it did with the Digital Millennium Copyright Act (DMCA) in 1998.”

In Nazemian, Keene and O’Nan v Nvidia, and Nazemian, Keene and O’Nan v Databricks and Mosaic ML the plaintiffs are represented by Joseph Saveri of Joseph Saveri Law Firm, as well as Matthew Butterick of Butterick Law and Laura Matson (pro hac vice pending) of Lockridge Grindal Nauen.

Already registered?

Login to your account

To request a FREE 2-week trial subscription, please signup.
NOTE - this can take up to 48hrs to be approved.

Two Weeks Free Trial

For multi-user price options, or to check if your company has an existing subscription that we can add you to for FREE, please email Adrian Tapping at atapping@newtonmedia.co.uk