Adobe’s aggressive expansion into artificial intelligence is facing a significant legal setback. The company stands accused of incorporating pirated literary materials into its machine learning infrastructure—a move that has sparked a class-action lawsuit centered on copyright violations.
The Core Allegation
Author Elizabeth Lyon from Oregon has filed a proposed class-action suit claiming that Adobe utilized unauthorized copies of books, including her own authored works, as training material for SlimLM, the company’s specialized language model designed for mobile document processing applications. According to court documents, these literary works were incorporated without author consent or compensation.
How the Pirated Books Made Their Way Into Adobe’s System
The pathway to this alleged misuse traces back to SlimPajama-627B, a public dataset created by Cerebras and released in mid-2023. Adobe relied on this dataset to pre-train SlimLM. However, the lawsuit reveals a problematic chain: SlimPajama itself was derived from RedPajama by incorporating Books3—a vast repository comprising 191,000 published works.
The critical issue: Books3 reportedly contains copyrighted material that was collected without proper authorization. When Adobe built upon this compromised foundation, the company allegedly inherited these copyright violations. As Lyon’s legal team notes, SlimLM became a derivative work containing unauthorized literary content.
A Pattern Emerging Across the Industry
Adobe is hardly the first technology firm to face such accusations. The underlying datasets fueling modern AI systems have become a minefield of copyright disputes:
Apple Intelligence Model: In September, Apple was sued for allegedly training its AI system on RedPajama-sourced material without compensating rights holders
Salesforce’s Training Practices: October saw similar litigation against Salesforce, accusing the company of using RedPajama datasets inappropriately
Anthropic’s Settlement: Most significantly, Anthropic agreed to a $1.5 billion settlement with authors in September, acknowledging it had incorporated pirated works into Claude’s training pipeline
Why This Matters
The proliferation of AI models requires enormous quantities of text data. When developers source from compilations like Books3 or RedPajama without thoroughly vetting legal provenance, they create institutional risk. The repeated lawsuits suggest that relying on these datasets—however convenient—now carries substantial legal exposure.
For Adobe and similar companies, the message is becoming inescapable: cutting corners on training data sourcing can prove far more expensive than legitimate licensing arrangements.
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
Adobe Faces Legal Challenge Over Unauthorized Use of Authors' Works in AI Model Development
Adobe’s aggressive expansion into artificial intelligence is facing a significant legal setback. The company stands accused of incorporating pirated literary materials into its machine learning infrastructure—a move that has sparked a class-action lawsuit centered on copyright violations.
The Core Allegation
Author Elizabeth Lyon from Oregon has filed a proposed class-action suit claiming that Adobe utilized unauthorized copies of books, including her own authored works, as training material for SlimLM, the company’s specialized language model designed for mobile document processing applications. According to court documents, these literary works were incorporated without author consent or compensation.
How the Pirated Books Made Their Way Into Adobe’s System
The pathway to this alleged misuse traces back to SlimPajama-627B, a public dataset created by Cerebras and released in mid-2023. Adobe relied on this dataset to pre-train SlimLM. However, the lawsuit reveals a problematic chain: SlimPajama itself was derived from RedPajama by incorporating Books3—a vast repository comprising 191,000 published works.
The critical issue: Books3 reportedly contains copyrighted material that was collected without proper authorization. When Adobe built upon this compromised foundation, the company allegedly inherited these copyright violations. As Lyon’s legal team notes, SlimLM became a derivative work containing unauthorized literary content.
A Pattern Emerging Across the Industry
Adobe is hardly the first technology firm to face such accusations. The underlying datasets fueling modern AI systems have become a minefield of copyright disputes:
Why This Matters
The proliferation of AI models requires enormous quantities of text data. When developers source from compilations like Books3 or RedPajama without thoroughly vetting legal provenance, they create institutional risk. The repeated lawsuits suggest that relying on these datasets—however convenient—now carries substantial legal exposure.
For Adobe and similar companies, the message is becoming inescapable: cutting corners on training data sourcing can prove far more expensive than legitimate licensing arrangements.