Child Sex Abuse Material Was Found In a Major AI Dataset. Researchers Aren’t Surprised

Àmbits Temàtics
The LAION-5B data­set is the basis for numer­ous AI models. Research­ers have long warned that massive train­ing data­sets are poorly audited.

Over 1,000 images of sexu­ally abused chil­dren have been discovered inside the largest data­set used to train image-gener­at­ing AI, shock­ing every­one except for the people who have warned about this exact sort of thing for years.

The data­set was created by LAION, a non-profit organ­iz­a­tion behind the massive image data­sets used by gener­at­ive AI systems like Stable Diffu­sion. Follow­ing a report from research­ers at Stan­ford Univer­sity, 404 Media repor­ted that LAION confirmed the pres­ence of child sexual abuse mater­ial (CSAM) in the data­set, called LAION-5B, and scrubbed it from their online chan­nels. 


The LAION-5B data­set contains links to 5 billion images scraped from the inter­net. 

AI ethics research­ers have long warned that the massive scale of AI train­ing data­sets makes it effect­ively impossible to filter them, or audit the AI models that use them. But tech compan­ies, eager to claim their piece of the grow­ing gener­at­ive AI market, have largely ignored these concerns, build­ing their vari­ous products on top of AI models that are trained using these massive data­sets. Stable Diffu­sion, one of the most commonly used text-to-image gener­a­tion systems, is based on LAION data, for example. And vari­ous other AI tools incor­por­ate parts of LAION’s data­sets in addi­tion to other sources.

This, AI ethics research­ers say, is the inev­it­able result of apathy.

“Not surpris­ing, [to be honest]. We found numer­ous disturb­ing and illegal content in the LAION data­set that didn’t make it into our paper, ” wrote Abeba Birhane, the lead author of a recent paper examin­ing the enorm­ous data­sets, in a tweet respond­ing to the Stan­ford report. “The LAION data­set gives us a [glimpse] into corp data­sets locked in corp labs like those in OpenAI, Meta, & Google. You can be sure, those closed data­set­s—rarely examined by inde­pend­ent audit­or­s—are much worse than the open LAION data­set.”

LAION told 404 Media that they were remov­ing the data­set “tempor­ar­ily” in order to remove the CSAM content the research­ers iden­ti­fied. But AI experts say the damage is already done.

“It’s sad but really unsur­pris­ing, ” Sasha Luccioni, an AI and data ethics researcher at Hugging­Face who co-authored the paper with Birhane, told Mother­board. “Pretty much all image gener­a­tion models used some version of [LAION]. And you can’t remove stuff that’s already been trained on it.”

The issue, said Luccioni, is that these massive troves of data aren’t being prop­erly analyzed before they’re used, and the scale of the data­sets makes filter­ing out unwanted mater­ial extremely diffi­cult. In other words, even if LAION manages to remove specific unwanted mater­ial after it’s discovered, the sheer size of the data means it’s virtu­ally impossible to ensure you’ve gotten rid of all of it—espe­cially if no one cares enough to even try before a product goes to market.

“Nobody wants to work on data because it’s not sexy, ” said Luccioni. “Nobody appre­ci­ates data work. Every­one just wants to make models go brrr.” (“Go brrr” is a meme refer­ring to a hypo­thet­ical money-print­ing machine). 

AI ethics research­ers have warned for years about the dangers of AI models and data­sets that contain racist and sexist text and images pulled from the inter­net, with study after study demon­strat­ing how these biases result in auto­mated systems that replic­ate and amplify discrim­in­a­tion in areas such as health­care, hous­ing, and poli­cing. The LAION data­set is another example of this “garbage-in, garbage-out” dynamic, where data­sets filled with expli­cit, illegal or offens­ive mater­ial become entrenched in the AI pipeline, result­ing in products and soft­ware that inherit all of the same issues and biases. 

These harms can be mitig­ated by fine-tuning systems after the fact, to try and prevent  them from gener­at­ing harm­ful or unwanted outputs. But research­ers like Luccioni warn that these tech­no­lo­gical tweaks don’t actu­ally address the root cause of the prob­lem.

“I think we need to rethink the way we collect and use data­sets in AI, funda­ment­ally, ” said Luccioni. “Other­wise it’s just tech­no­lo­gical fixes that don’t solve the under­ly­ing issue.”


New York, US
Image: SOPA Images/Contrib­utor via Getty Images