shutterstock_2259548137_gguy
12 July 2023Marisa Woutersen

Google accused of 'stealing everything ever created and shared on the internet' to create Bard

Eight individuals have accused Google of stealing emails, creative works, and photographs to build its commercial AI products | Bard complicit in ‘data theft from millions’ | Public discourse efforts over data scraping derided by complaint.

Google has been accused of "stealing everything ever created and shared on the internet" to train its chatbot Bard, in a class action lawsuit.

Eight individuals, who include an author, an actor, two minors and users of Google accounts, claimed Google, parent company Alphabet, and Google DeepMind, used creative works, photographs, and even emails to train its AI products

The lawsuit, filed in the US District Court of Northern California on July 11, 2023, alleged that Google harvested vast amounts of data without notice or consent to develop the chatbot, which was released to compete with OpenAI's ChatGPT.

Google's data collection practices have violated privacy rights and undermined consumer protection interests, claimed the plaintiffs.

“Alongside property and privacy rights, users retain copyright interests over their unique and original content posted online. This content includes text, images, music, video content, and other forms of creative expression, all of which fall under the purview of copyright law,” said the complaint.

The alleged method used by Google to attain data to be used to train their AI systems is known as web scraping.

This practice, claimed the lawsuit, “nullifies the concept of ‘fair use’, a critical aspect of copyright law designed to allow limited use of copyrighted material without permission for purposes like commentary, criticism, news reporting, and scholarly reports.”

AI bots such as Bard rely on using personal data to enable human-like communication skills, the complaint said, while creative and expressive works enable these systems to have artistic capabilities.

The complaint contends that Google's actions not only infringed upon individuals' rights but also gave the company an unfair advantage over competitors who lawfully acquire AI training data.

Bard ‘trained on protected works’

The complaint cites MusicLM, a generative AI with text-to-music capabilities that uses Google AI, which it said was trained on 280,000 hours of music from the Free Music Archive, which offers free access to open licensed—but still copyrighted—original music.

In January 2023, Google had “no immediate plans” for the release of MusicLM due to ethical concerns, including “a tendency to incorporate copyrighted material from training data into the generated songs.”

However, it released a limited version publicly on May 10, 2023. Products like MusicLM violate copyright law by creating “tapestries of coherent audio from works they ingest in training, thereby infringing the United States Copyright Act’s reproduction right,” said the complaint.

Additionally, the C-4 dataset, created by Google in 2020, is taken from the Common Crawl dataset. The Common Crawl dataset is a massive collection of web pages and websites consisting of petabytes of data collected over 12 years, including raw web page data, metadata extracts, and text extracts.

C-4 is also supposedly rife with copyrighted and protected works, with the copyright symbol appearing more than 200 million times within the dataset.

Another example given in the complaint on Google’s infringement included the exact digital version of one plaintiff’s book as well as the insights and opinions she has offered to various media outlets, to develop Bard’s language model.

It also states, if a user requests Bard to reproduce paragraphs from the book, or analyse or summarise the book, Bard generates an output that would have been impossible without training Bard on the book itself.

Considering the scale of the alleged copyright violations, the individuals are concerned content creators will be “dissuaded from investing in the considerable costs of producing unique content in electronic formats”, claimed the lawsuit.

“This not only threatens to drastically reshape online accessibility of paid, restricted materials, but also imposes economic harm on a substantial number of content creators,” it added.

Plaintiffs: ‘Public outrage’ over Google data scraping

The suit highlighted a warning issued by the US Federal Trade Commission in June regarding the “sudden sprint” by AI tool developers to collect as much training data as they can find, in which the commission said: “Machine learning is no excuse to break the law… The data you use to improve your algorithms must be lawfully collected… companies would do well to heed this lesson.”

However, Google allegedly ignored this warning and “elected instead to quietly ‘update’ its online privacy policy last week to double-down on its position that everything on the internet is fair game for the company to take for private gain and commercial use, including to build and enhance AI products like Bard,” said the lawsuit.

The company’s “sudden notice and admission” regarding its scraping practices came three days after OpenAI was sued for theft and commercial misappropriation of personal data on the internet, as part of its own massive “scraping” operation, the suit alleges.

According to the complaint, once made public, Google’s data scraping practices caused outrage. In response to the backlash, Google invited public dialogue on data collection and protection in the age of AI.

However, this move faced criticism for being “too little, too late”, with many perceiving it as an attempt to salvage the company's reputation, the complaint alleged.

The suit claimed one “commentator” translated the company’s “invitation” into: “Now that we’ve already trained our LLMs on all your proprietary and copyrighted content, we will finally start thinking about giving you a way to opt out of any of your future content for being used to make us rich.”

The individuals in the class action argued that Google should have proactively sought consent or lawfully acquired data rather than engaging in widespread data theft.

The plaintiffs demanded that Google cease its ongoing violations of privacy and property rights, allowing internet users to opt out of data collection efforts. They also sought the deletion of illegally obtained data or fair compensation for its use.

Google’s alleged unauthorised scraping, duplication, and utilisation of these copyrighted materials, also constituted a clear breach of copyright laws.

“The unauthorised collection and use of copyrighted literary works in training Bard not only infringes on the rights of the producers but also damages the intrinsic value of the copyrighted works,” said the complaint.

AI prompts new laws and lawsuits

The lawsuit follows a growing number of cases accusing big tech companies of copying material to develop their AI software, amid an ongoing debate over ownership of data and content in AI products.

Last week, a group of authors, including US comedian Sarah Silverman, sued Meta and OpenAI (which created ChatGPT) for allegedly stealing their material to train AI models.

And two cases earlier this year featured AI art tools being sued for alleged copyright infringement of images: Getty Images v Stability AI; and a dispute between a trio of artists and three AI platforms: Stability AI, Midjourney, and DeviantArt.

Such litigation is prompting a growing global debate over the regulation of AI, including the protection of IP.

G7 leaders met at the Hiroshima Summit in May to discuss pan-global regulation on AI to make the technology “trustworthy”.

And last month, EU lawmakers backed the introduction of the landmark AI Act, thought to be the first of its kind in the world. The legislation could force companies behind tools such as ChatGPT to reveal the materials they used to train their generative AI systems.

Did you enjoy reading this story?  Sign up to our free daily newsletters and get stories sent like this straight to your inbox

Already registered?

Login to your account

To request a FREE 2-week trial subscription, please signup.
NOTE - this can take up to 48hrs to be approved.

Two Weeks Free Trial

For multi-user price options, or to check if your company has an existing subscription that we can add you to for FREE, please email Adrian Tapping at atapping@newtonmedia.co.uk


More on this story

Copyright
12 June 2023   Firefly is a ‘commercially and legally safe’ tool that automatically attaches credentials tag that flags AI use | Software firm promises to protect customers from infringement claims.
Copyright
21 July 2023   Dispute hinges on whether the output from an AI infringes as a derivative work | Stability AI says suit lacks merit, MidJourney argues that generative outputs do not bear similarity to artists’ works | Host of defects identified in original complaint.
Patents
6 September 2023   Tech company accused of infringing patents in Google Play Music software | Judge grants motion for new trial in dispute over audio player that automatically plays a predetermined schedule of songs.