TECH

The Atlantic exposes music in AI training datasets

31+

Signals

Strategic Overview

01.
The Atlantic published four searchable databases revealing over 21 million tracks (some reports cite north of 22 million) used to train generative AI music models.
02.
The four datasets break down into two large collections of roughly 12 million and 9 million tracks and two smaller ones of about 100,000 tracks each, the largest being LAION-DISCO-12M released by the German non-profit LAION in November 2024.
03.
Google and Stability AI confirmed in research papers that they trained on tracks from one of the smaller collections, the Free Music Archive, a ~100,000-track academic dataset published in 2017.
04.
The datasets contain hits from major artists including Taylor Swift, Bad Bunny, Billie Eilish, Nirvana, Pearl Jam, and the Beatles, alongside tens of thousands of independent, jazz, and classical artists.

The 'Research Only' Label That Holds 21 Million Songs

The most revealing detail in The Atlantic's investigation is not the size of the datasets but how innocuous their packaging is. The largest, LAION-DISCO-12M, was released in November 2024 by LAION, a German non-profit, and contains no actual audio files at all ^[1]. It is a manifest: 12+ million links to YouTube tracks plus metadata, explicitly published "for research purposes" and intended for use "in academic settings" ^[1]. The Free Music Archive, one of the two smaller ~100,000-track collections, follows the same template — assembled in 2017 by academic researchers for music-information-retrieval work, directed by the radio station WFMU, and built on Creative Commons licenses ^[1].

That framing is the entire problem. A dataset that is technically just a list of URLs and a stack of CC-tagged files looks harmless on a project page, but it is a turnkey ingestion pipeline for anyone training a model. The Atlantic's reporting shows the gap between the stated intent (academic study) and the actual use (commercial music generation) is where the consent of millions of artists evaporated ^[2]. The non-profit can say "don't deploy this commercially" and the developer can quietly do exactly that, because nothing in the dataset's structure enforces the boundary. The reporting also stresses a hard limit of the evidence: the tool can show which songs sit inside a dataset, but not which specific company pulled them out ^[3].

Why a Searchable Tool Changes the Lawsuits, Not Just the Discourse

Until now, the central obstacle for artists has been evidentiary: AI music companies kept their training sources hidden, making it nearly impossible to prove any individual song was ingested ^[2]. The Atlantic's searchable database flips that. Artists, labels, and legal teams can now look up specific tracks and confirm they appear in the datasets — the kind of empirical proof that has been hard to produce in court ^[2]. That matters because the stakes are already concrete: the RIAA's suits against Suno and Udio seek up to $150,000 per infringed work, and UMG and Sony moved to add more than 61,000 recordings to the Suno case alone ^[4].

The defendants have not been subtle about what they did. In federal court responses, Suno and Udio admitted training on essentially all music of reasonable quality on the open internet, leaning on a fair use defense that intermediate training copies are non-infringing ^[5]. The RIAA's framing is the mirror image — that there is "nothing fair about stealing an artist's life's work, extracting its core value, and repackaging it" ^[5]. With Warner having settled with Suno and UMG with Udio, but Sony holding out, a pivotal Sony ruling is expected in summer 2026 ^[6]— and a public, queryable record of what is inside these datasets lands at exactly the moment it can shape that fight.

The Independent Artists Who Can't Afford to Fight

The headline names — Taylor Swift, Bad Bunny, Billie Eilish, Nirvana, the Beatles — are not the ones who carry the cost ^[7]. The datasets also swept in tens of thousands of independent musicians, jazz players, and classical composers who have no label legal department behind them ^[7]. The blunt assessment from the independent side is that recourse is effectively gated behind the majors: as Valholla Records' Vince Valholla put it, "until the major labels go through their lawsuits, there's no way for artists or labels to fight back" ^[8]. Rights-collection bodies are voicing the same anger in starker terms, with APRA AMCOS chief Dean Ormston describing the use of songwriters' catalogs without permission, licence, or payment as treating their life's work as expendable ^[8].

Community reaction has tracked that asymmetry. Across X, the dominant sentiment from working musicians has been a mix of vindication and alarm — relief at finally having proof, paired with the realization that even small back-catalogs were ingested. AI-ethics figure Ed Newton-Rex captured the moment by noting that recordings he sang on were in the datasets. On Reddit, the prevailing read was cynical about a perceived double standard in how copyright is enforced — that mass scraping gets reframed as innovation when a company does it — alongside grim resignation that underground and independent artists were pulled in just the same.

What Everyone's Missing: The Pipeline Is Already Repricing Itself

The most forward-looking signal is that the industry is not waiting for a verdict to change behavior. The settlement wave is steering the biggest defendants toward licensed AI music platforms — Warner with Suno, UMG with Udio — which reframes the whole dispute from "is scraping legal" to "what does licensed training cost" ^[6]. Developer and creator discussion has already pivoted in that direction, with the most-watched commentary focused on the Suno settlement's licensing implications and on Suno moving to retire its original unlicensed models in favor of licensed data.

That is the contrarian read worth holding onto: The Atlantic's database is being received as an exposé, but its longer-term effect may be commercial rather than punitive. Once a song's presence in a training set is searchable and provable, "we only used freely available content" stops being a defensible posture ^[3], and licensing becomes the path of least resistance for any company that wants to ship without litigation risk. The open question is whether that consolidation actually pays independent artists — or whether it simply moves the negotiation behind the same major-label doors that already gate everything else, leaving the long tail with proof of theft but still no seat at the table.

Historical Context

2017

The Free Music Archive dataset of roughly 100,000 tracks was published by academic researchers for music-information-retrieval work, directed by US radio station WFMU.

2024-06

The RIAA sued Suno and Udio on behalf of UMG, Sony, and Warner over unauthorized use of copyrighted recordings, seeking up to $150,000 per work.

2024-08

Both companies filed federal court responses admitting they trained on essentially all music of reasonable quality on the open internet, defending it as fair use.

2024-11-19

LAION released LAION-DISCO-12M, the largest publicly available open music dataset, with 12+ million YouTube audio links and metadata.

2025-10

Warner settled with Suno and UMG settled with Udio toward licensed AI music platforms; independent artists filed class actions while Sony held out, heading toward a pivotal summer 2026 ruling.

2026-06-16

The Atlantic published its investigation and searchable database revealing the four datasets totaling 21+ million tracks.

Power Map

Key Players

Subject

The Atlantic exposes music in AI training datasets

The Atlantic / Alex Reisner

Investigative publisher and staff writer who identified the four datasets and built the public searchable tool, converting opaque training data into court-usable evidence.

LAION

German non-profit that compiled and released LAION-DISCO-12M, the largest dataset; says it is for research and academic use only and warns against commercial deployment.

Google and Stability AI

AI developers who confirmed in research papers they trained on the Free Music Archive; Stability emphasizes Stable Audio 3.0 was trained on licensed music.

Suno and Udio

AI music generators defending themselves on fair use grounds; Suno admitted in court filings it trained on essentially all music of reasonable quality on the open internet.

RIAA / major labels (UMG, Sony, Warner)

Plaintiffs in landmark copyright suits against Suno and Udio; UMG and Sony sought to add 61,000+ recordings to the Suno case.

Independent artists and songwriters

Rights holders using the tool to find their catalogs in datasets without consent, with little recourse of their own until the major-label suits resolve.

Fact Check

8 cited

Source Articles

Top 3

THE SIGNAL.

Analysts

"Companies often claim to use only content that is freely available online, but the datasets reveal the quantity of downloadable music that developers can access even though it is not supposed to be free."

Alex Reisner

Staff writer and investigative journalist, The Atlantic

"No permission, no licence, no payment. These are not bargaining chips, they are the life's work of Australian and New Zealand songwriters."

Dean Ormston

Chief Executive, APRA AMCOS

"To be honest, until the major labels go through their lawsuits, there's no way for artists or labels to fight back."

Vince Valholla

Head of Valholla Records

"There's nothing fair about stealing an artist's life's work, extracting its core value, and repackaging it."

RIAA

Recording Industry Association of America (representing major labels)

The Crowd

"The Atlantic just released a tool that lets you see if your music has been scraped for AI training. Recordings I sang on in King's College Choir are in there. So is the music of millions of other musicians. Great work by @_alexreisner. Check it here: https://t.co/2oT8tLeCJM"

@@ednewtonrex1081

"Late last night I found out over 100+ songs from our catalog were used to train AI models. Thanks to The Atlantic, they leaked a database of millions of songs that have been used by the biggest AI music companies like Udio and Suno. To be honest, until the major labels go https://t.co/3d2cmei0u9"

@@VinceValholla193

"AI music generators are trained on an unfathomable number of songs, Alex Reisner reports. Search for an artist or track in four giant data sets he obtained: https://t.co/PCiJsRACSG"

@@TheAtlantic1

"Investigation by The Atlantic reveals many millions of songs used for AI music training"

@u/Plastic_Ninja_9014573

Broadcast

The Atlantic Claims Millions Of Songs Are Used To Train AI Music

Suno AI Lawsuit Just Settled - What Does It Mean for Music Licensing?

Massive Suno v5.5 UPDATE — The Copyright Lawsuit Every AI Artist Should Be Watching