AI Development as a Copyright Battlefield: On a Fair Use Provision for Text and Data Mining

Feb 23

Developments in artificial intelligence are reshaping systems of technology and knowledge, sparking controversy about AI’s ethical implications. This paper analyzes how AI fits within the current understanding of U.S. copyright law. I propose the creation of a fair use text and data mining (TDM) exception for artificial intelligence, enabling the United States to respond to copyright ambiguities created by deep learning algorithms.

We often think about AI and copyright from the perspective of outputs, such as the extent to which an algorithm’s work can be copyrighted or an algorithm could infringe a copyright in its products. [1, 2] However, debates over the legality of copyrighted data in AI inputs similarly raise ethical and legal issues.

Large language models (LLMs) are built through deep learning, enabling these algorithms to comprehend increasingly complex concepts.[3] This process, from which the system can build connections and develop itself, requires massive amounts of data, or inputs. This usage necessitates billions of sources, all scraped off of the internet through Text and Data Mining (TDM). [4] In most cases, each work is then replicated and included within a dataset the LLM learns from. Notably, these works do not appear in an LLM or its outputs, instead serving to develop the system.

This procedure poses a conundrum: while in most human cases, any complete copying of a work would constitute a copyright violation, in the context of TDM, required for the development of AI, the inputs do not end up being expressed in the final product. [5] Uncertainty regarding the application of United States Fair Usage laws to AI development has resulted in lawsuits and industry uncertainty. To what extent should AI companies be responsible for compensation for the use of copyrighted works in training LLMs and how can government regulation solve this dilemma?

The U.S. developed the Fair Use Doctrine, which allows usage of a copyrighted work without licensure, in order to balance copyright protections with creative freedoms. [6] Fair use categorization relies on four criteria: whether the use was “transformative,” the nature of the copyrighted work, how much was used, and how that impacted the work’s market.

Although TDM has historically been considered fair use, AI’s widespread impact has created ambiguity over whether AI TDM would meet this benchmark. [7] While the transformative nature of copyrighted works in TDM might signify fair use, TDM requires complete copying, most works copied are creative, and LLMs can in some cases impact the work’s market. Under the current Fair Use Doctrine, one could reach different conclusions about the applicability of fair use depending on what aspect of the criterion they looked at. Such ambiguity leads to lawsuits, wastes resources, and inhibits innovation.

This past summer, courts ruled in favor of both Anthropic, the creator of Claude, and Meta, of Llama and Meta AI, concerning the legality of training their respective LLMs with copyrighted books. [8] In the case of Anthropic, Judge Alsup ruled that such usage could be categorized as fair use because its inclusion was fundamentally transformative. As for Meta, Judge Chhabria ruled in favor of the social media giant more hesitantly, explaining that the plaintiffs had not proven sufficient harm. In his ruling, Chhabria explained that he could see himself voting differently if the case had been a class action lawsuit, noting that “this ruling does not stand for the proposition that Meta’s use of copyrighted materials to train its language models is lawful.” [9]

These decisions reveal that the legality of AI companies’ usage of copyrighted works remains up in the air. This ambiguity harms both companies and authors. Copyright violators can owe up to $150,000 for each pirated work, which, considering the widespread nature of AI and sheer amount of data it requires, would amount to unfathomably large fines. [10] While such an extreme outcome is unlikely, concerns over possible suits place companies in a perpetual state of uncertainty, hurting AI innovation within the United States.

I propose that the U.S. copyright office should explicitly create a fair use exception pertaining to artificial intelligence for, and only for, Text and Data Mining for LLM development. This would provide legal certainty to developers and protect authors by eliminating fair use ambiguities.

The EU Digital Single Market (DSM) Directive serves as a fairly successful model for this structure. [11, 12] It allows research organizations to have complete access to copyrighted materials for TDM but gives AI companies only conditional usage, giving authors the opportunity to opt-out. Creating a fair use exception for AI Text and Data Mining, with or without an opt-out condition, would provide companies legal certainty without dramatically altering U.S. fair use principles.

Notably, I recommend that the United States employ complete exemptions for all TDMs, using a blanket, rather than opt-out, provision to streamline the law and prevent enforcement challenges. Opt-out provisions would do more harm than good to both creatives and technology companies. [13] Companies would be forced to only use works covered by licensing agreements, leading to inputs predominantly representing the mainstream media of large conglomerates that navigate the licensure process. This disparity would exacerbate issues with AI biases, resulting in negative externalities more worrisome than any benefit to authors.

Because AI systems have already been developed and already include so many data points, such a policy would be difficult to accurately implement. As illustrated by AI companies’ circumvention of such regulations in the EU, the blackbox nature of AI means that any exception would only marginally benefit authors in the first place. [14] Authors who opt-out would still end up being harmed, losing market share to the LLM.

This policy proposal has many strengths. AI companies would be able to maximize their algorithm’s productivity without having to worry about lawsuits. Explicitly limiting fair use protections for AI to TDM would strengthen profitability by removing concerns about the proprietary technology violating copyright laws while also encouraging companies to be more careful about copyright violations in AI outputs.

Additionally, the clarity of a blanket acceptance of TDM fair use would make regulation easier. It is difficult to measure the effects that the inclusion of a work in an LLM’s training dataset will have on its parameters, problematizing the feasibility of any law that requires licensure or tracking of LLM inputs. Without that requirement, this law would be much easier to implement than the European Directive’s opt-out procedure.

One might say that this law provides insufficient compensation for author roles in LLM development. Since written works form the basis of LLM development, authors argue that they should get a portion of the profits. With Open AI reaching more than $12 billion in annualized revenue this year, it makes sense why they want to get a piece of the pie. [15] This argument is unconvincing for three reasons.

Authors argue against fair use of their copyrighted works in AI TDM because it results in a “loss of profit.” There is no doubt that AI fundamentally changes how we process information and how we respond to creative works, but that is not relevant to TDM itself. TDM inputs are used to shape AI’s language processing, not to create a permanent response bank; they are deleted after training is complete. Since any profit loss relates to LLM outputs, not the model itself, preventing authors from suing over inputs would not disadvantage them.

If anything, this law’s explicit “if and only if” principle would help authors by clarifying that companies are only allowed to freely use copyrighted works for data training. TDM copies are non-expressive and not retained in a retrievable form. Because TDM and deep learning pertain to algorithm development, companies would no longer be able to argue that their use of a work in an output was fair use. This distinction strengthens authors’ legal position. Because copyright centers around expressed work, this proposal would support authors by enabling copyrighting of their works where it matters, providing them with a stronger base from which to sue about the inclusion of such works in AI products.

While a less creator-centric approach may be considered a weakness, the TDM fair use exception remains the best option available. This is because it is impossible to calculate what and how inputs are used in an LLM. More author-focused options, such as licensure, would be nearly impossible to implement and would create prohibitive costs. [16] Licensure would jeopardize the U.S. AI markets, having a much larger negative externality than my proposal. As such, a TDM free use exemption remains the best way to balance author and AI company incentives, providing legal security necessary for industry innovation while also creating space for copyright protections pertaining to LLM outputs.

Edited by Taran Srikonda

Endnotes

[1] Adil S. Al-Busaidi et al., "Redefining boundaries in innovation and knowledge domains: Investigating the impact of generative artificial intelligence on copyright and intellectual property rights," Journal of Innovation & Knowledge Vol. 9, No. 4: 100630, December 2024, https://www.sciencedirect.com/science/article/pii/S2444569X24001690#:~:text=Abstract,in%20a%20GenAI%2Ddriven%20era.

[2] Enrico Bonadio et. al.,"Can artificial intelligence infringe copyright? Some reflections," Research handbook on intellectual property and artificial intelligence, pp. 245-257, December 13, 2022, https://eprints.lse.ac.uk/117745/1/McDonagh_can_artificial_intelligence_infringe_copyright_accepted.pdf.

[3] Melanie Mitchell, Artificial intelligence: A guide for thinking humans, Penguin UK, 2019.

[4] Rita Matulionyte, "Australian Copyright Law Impedes the Development of Artificial Intelligence: What Are the Options?" IIC-International Review of Intellectual Property and Competition Law vol 52, no. 4 (2021): 417-443, https://link.springer.com/article/10.1007/s40319-021-01039-9#citeas. Pgs. 419-20

[5] Joshua Love, “Geopolitics of AI: Text and data mining in U.S.,” Reed Smith. February 5, 2024, https://www.reedsmith.com/en/perspectives/ai-in-entertainment-and-media/2024/02/text-and-data-mining-in-us.

[6] U.S. Copyright Office Fair Use Index,” U.S. Copyright Office, Last updated August 2025, https://www.copyright.gov/fair-use/.

[7] Joshua Love, “Geopolitics of AI: Text and data mining in U.S.,” Reed Smith. February 5, 2024, https://www.reedsmith.com/en/perspectives/ai-in-entertainment-and-media/2024/02/text-and-data-mining-in-us.

[8] Chloe Veltman, “In a first-of-its-kind decision, an AI company wins a copyright infringement lawsuit brought by authors,” NPR. June 25, 2025, https://www.npr.org/2025/06/25/nx-s1-5445242/federal-rules-in-ai-companys-favor-in-landmark-copyright-infringement-lawsuit-authors-bartz-graeber-wallace-johnson-anthropic.

[9] Chloe Veltman, “In a first-of-its-kind decision, an AI company wins a copyright infringement lawsuit brought by authors,” NPR. June 25, 2025, https://www.npr.org/2025/06/25/nx-s1-5445242/federal-rules-in-ai-companys-favor-in-landmark-copyright-infringement-lawsuit-authors-bartz-graeber-wallace-johnson-anthropic.

[10] Dave Hansen, “The Bartz v. Anthropic Settlement: Understanding America's Largest Copyright Settlement.” Kluwer Copyright Blog, November 10, 2025, https://legalblogs.wolterskluwer.com/copyright-blog/the-bartz-v-anthropic-settlement-understanding-americas-largest-copyright-settlement/.

[11] Serena Chu Lightstone, "Train or restrain? Using international perspectives to inform the American fair use analysis of copyright in generative artificial intelligence training," Northwestern Journal of International Law & Business Vol. 44, No. 3, Spring 2024, pgs 471-504, HeinOnline, Pg. 474.

[12] Saliltorn Thongmeensuk, "Rethinking copyright exceptions in the era of generative AI: Balancing innovation and intellectual property protection," The Journal of World Intellectual Property 27, no. 2 (2024): 278-295.

[13] Martin Senftleben, “The TDM Opt-Out in the EU – Five Problems, One Solution,” Kluwer Copyright Blog, April 22, 2025, https://legalblogs.wolterskluwer.com/copyright-blog/the-tdm-opt-out-in-the-eu-five-problems-one-solution/.

[14] Annica Ryng, “CMOs challenges in the AI era - part 1,” Society of Audiovisual Authors, August 11, 2025, https://www.saa-authors.eu/articles/cmos-challenges-in-the-ai-era#:~:text=AI%20companies%20bypass%20licensing%20obligations,is%20unfairly%20placed%20on%20them.

[15] “OpenAI hits $12 billion in annualized revenue, The Information reports,” Reuters, July 30, 2025, https://www.reuters.com/business/openai-hits-12-billion-annualized-revenue-information-reports-2025-07-31/.

[16] Rita Matulionyte, "Australian Copyright Law Impedes the Development of Artificial Intelligence: What Are the Options?" IIC-International Review of Intellectual Property and Competition Law vol 52, no. 4 (2021): 417-443, https://link.springer.com/article/10.1007/s40319-021-01039-9#citeas. Pgs. 429-34.

Margalit Salkin

AI Development as a Copyright Battlefield: On a Fair Use Provision for Text and Data Mining

Bargaining with the Sixth Amendment: The Role of Risk Aversion and “Trial Penalties” in Plea Bargaining

Seventy-five Years of Data: Evaluating DHS’s Expanded Biometric Record