Notes: Ben Sobel is a post-doctoral fellow at Cornell Tech, and he will join us for the discussion. This is not a comprehensive overview of the infringement issues involved in AI training; instead, it focuses on one of the most important ones.
Questions:
How are (potentially) copyrighted works used in AI training?
Why would copyright law have a problem with this?
When should fair use protect AI training? If your answer is not “always” or “never,” how should courts draw the line?
This article was written before the recent explosion of interest in generative AI. Does anything in the analysis change when AI is used to create new creative works, rather than to make (non-expressive) decisions?
Additional Resources:
Amanda Levendowski, How Copyright Law Can Fix Artificial Intelligence’s Implicit Bias Problem, 93 Washington Law Review 579 (2018). Another early article about copyright and AI training, but one that framed the problem a little differently than Sobel did: copyright stands in the way not of AI innovation but of AI fairness. Compare Amanda Levendowski, Resisting Face Surveillance with Copyright Law, 100 North Carolina Law Review 1015 (2022), in which Levendowski argues that copyright should be used to limit some forms of AI training in the name of other important social values.
Mark A. Lemley and Bryan Casey, Fair Learning, 99 Texas Law Review 743 (2021). A particularly clear and well-written take on fair use and AI training.
James Grimmelmann, Copyright for Literate Robots, 101 Iowa Law review 657 (2016). My own piece on bulk technological uses of copyrighted works applies to AI, but is not entirely about AI. It’s a little more polemical than the other readings here.
Benjamin L.W. Sobel, Elements of Style: Copyright, Similarity, and Generative AI, 38 Harvard Journal of Law and Technology (forthcoming). A more recent piece by Ben, which focuses on substantial similarity of AI outputs rather than on fair use for AI training.
Mark A. Lemley, How Generative AI Turns Copyright Upside Down, 25 Columbia Journal of Law and the Arts 21 (2024). Another Lemley piece, one that gets at the subtle and tricky relationship between AI prompts and AI outputs.
A. Feder Cooper and James Grimmelmann, The Files are in the Computer: On Copyright, Memorization, and Generative AI, Chicago-Kent Law Review (forthcoming). Is a model a “copy” of the works it was trained on? We think the answer can be “yes,” at least sometimes—when a generative model has memorized a work, in the sense that it can produce a substantially similar version of that work as an output.
Matthew Sag, Copyright Safety for Generative AI, 61 Houston Law Review 295 (2023). One of the most important articles on fair use and AI training from the modern era, i.e., after the launch of ChatGPT. Sag makes several excellent points about the different ways in which an AI system can learn from a copyrighted work, some of which may be infringing and some of which may not.