Cutting-edge artificial intelligence systems can help you escape a parking fine, write an academic essay, or fool you into believing Pope Francis is a fashionista. But the virtual libraries behind this breathtaking technology are vast – and there are concerns they are operating in breach of personal data and copyright laws.
The enormous datasets used to train the latest generation of these AI systems, like those behind ChatGPT and Stable Diffusion, are likely to contain billions of images scraped from the internet, millions of pirated ebooks, the entire proceedings of 16 years of the European parliament and the whole of English-language Wikipedia.
But the industry’s voracious appetite for big data is starting to cause problems, as regulators and courts around the world crack down on researchers hoovering up content without consent or notice. In response, AI labs are fighting to keep their datasets secret, or even flicking two fingers at regulators entirely and daring them to push the issue.
In Italy, ChatGPT has been banned from operating after the country’s data protection regulator said there appeared to be no legal basis to justify the collection and “massive storage” of personal data in order to train the GPT AI. On Tuesday, the Canadian privacy commissioner followed suit with an investigation into the company in response to a complaint alleging “the collection, use and disclosure of personal information without consent”.
Britain’s data watchdog expressed its own concerns. “Data protection law still applies when the personal information that you’re processing comes from publicly accessible sources,” said Stephen Almond, thedirector of technology and innovation at the Information Commissioner’s Office.
Michael Wooldridge, a professor of computer science at the University of Oxford, says “large language models” (LLMs) like those which underpin OpenAI’s ChatGPT and Google’s Bard hoover up colossal amounts of data.
“This includes the whole of the world wide web – everything. Every link is followed in every page, and every link in those pages is followed … In that unimaginable amount of data there is probably a lot of data about you and me,” he says, adding that comments about a person and their work could also be gathered by a LLM. “And it isn’t stored in a big database somewhere – we can’t look to see exactly what information it has on me. It is all buried away in enormous, opaque neural networks.”
Wooldridge adds that copyright will be a “coming storm” for AI companies. LLMs are likely to have accessed copyrighted material such as news articles. Indeed the GPT-4-assisted chatbot attached to Microsoft’s Bing search engine cites news sites in its answers. “I didn’t give explicit permission for my works to be used as training data, but they almost certainly were, and now they contribute to what these models know,” he says.
“Many artists are gravely concerned that their livelihoods are at risk from generative AI. Expect to see legal battles,” he adds.
Lawsuits have emerged already, with stock photo company Getty Images suing British startup Stability AI – the company behind AI image generator Stable Diffusion – after claiming that the image-generation firm violated copyright by using millions of unlicensed Getty Photos to train its system. In the US a group of artists is suing Midjourney and Stability AI in a lawsuit that claims the companies “violated the rights of millions of artists” in developing their products by using artists’ work without their permission.
Awkwardly for Stability, Stable Diffusion will occasionally spit out pictures with a Getty Images watermark intact, examples of which the photography agency included in its lawsuit. In January, researchers at Google even managed to prompt the Stable Diffusion system to recreate near-perfectly one of the unlicensed images it had been trained on, a portrait of American evangelist Anne Graham Lotz.
Copyright lawsuits and regulator actions against OpenAI are hampered by the company’s absolute secrecy about its own training data. In response to the Italian ban, Sam Altman, the chief executive of ChatGPT-developer OpenAI, said: “We think we are following all privacy laws.” But the company has refused to share any information about what data was used to train GPT-4, the latest version of the underlying technology that powers ChatGPT.
Even in its “technical report” describing the AI, the company curtly says only that it was trained “using both publicly available data (such as internet data) and data licensed from third-party providers”. Further information is hidden, it says, due to “both the competitive landscape and the safety implications of large-scale models like GPT-4”.
Others take the opposite view. EleutherAI describes itself as a “nonprofit AI research lab”, and was founded in 2020 with the goal of recreating GPT-3 and releasing it to the public. To that end, the group put together the Pile, an 825 gigabyte collection of datasets gathered from every corner of the internet. It includes 100GB of ebooks taken from the pirate site bibliotik, another 100GB of computer code scraped from the Github, and a collection of 228GB of websites gathered from across the internet since 2008 – all, the group acknowledges, without the consent of the authors involved.
Eleuther argues that the datasets in the Pile have all been so widely shared already that its compilation “does not constitute significantly increased harm”. But the group doesn’t take the legal risk of directly hosting the data, instead turning to a group of anonymous “data enthusiasts” called the Eye, whose copyright takedown policy is a video of a choir of fully clothed women pretending to masturbate while singing.
Some of the information produced by chatbots has also been false. ChatGPT has falsely accused a US law professor, Jonathan Turley, of George Washington University, of sexually harassing one of his students – citing a news article that didn’t even exist. The Italian regulator had also referred to the fact that ChatGPT’s responses do not “always match factual circumstances” and “inaccurate personal data are processed.”
Concerns over how AI are trained came as an annual report into progress in AI showed that commercial players are dominating the industry, over academic institutions and governments.
According to the 2023 AI Index report, compiled by California-based Stanford University, last year there were 32 significant industry-produced machine-learning models compared to just three produced by academia. Up until 2014, most of the significant models came from the academic sphere. But since then the cost of developing AI models, including staff and computing power, has risen.
“Across the board, large language and multimodal models are becoming larger and pricier,” said the Index. An early iteration of the LLM behind ChatGPT, known as GPT-2, had 1.5bn parameters, analogous to the neurons in a human brain, and cost an estimated $50,000 to train. By comparison, Google’s PaLM had 540bn parameters and cost an estimated $8m.
This has raised concerns that corporate entities will take a less measured approach to risk than academia or government-backed projects. Last week a letter whose signatories included Elon Musk and the Apple co-founder Steve Wozniak called for an immediate pause in the creation of “giant AI experiments” for at least six months. The letter said there were concerns that tech firms were creating “ever more powerful digital minds” that no one could “understand, predict, or reliably control”.
“Big AI means that these AIs are being created purely by large profit-driven corporates, which unfortunately means that our interests as human beings, aren’t necessarily well represented,” said Dr Andrew Rogoyski at the Institute for People-Centred AI at the University of Surrey.
He added: “We have to focus our efforts on making AI smaller, more efficient, requiring less data, less electricity so that we can democratise access to AI.”