OPEN

ai matters: Copyright x Generative AI

First up in the series of intellectual property (IP) issues impacted by and evolving with generative AI (deep learning), is that of copyright. For simplicity purposes, let’s divide copyright issues into two categories: one of “traditional” copyright which covers what we will call artistic expression, and includes creations such as photos, musical lyrics, literature and other visual arts. We are calling this ‘traditional’ copyright here, as these were the types of work originally considered when drafting various copyright laws in the US and elsewhere. The second category is software. While the law itself does not distinguish between copyright of more ‘traditional’ art and software, the particular issues and application of the law diverges enough to make it a worthwhile distinction when exploring the current developments as they intersect with AI.

In this installation of the series, let’s start with the traditional copyright. This is the area that has arguably attracted the most attention and for which there is currently quite a growing body of litigation, and more recently, some legislation. Let’s first orient toward the issues with some basics of copyright law and then begin with a walk through of the current case clusters in the US and talk briefly about each.

Copyright in the U.S.

First, perhaps, it is helpful to understand the distinctions within the copyright issues. In the U.S., copyright attaches (i.e., is created) automatically when a work is “fixed in a tangible medium of expression” (Copyright Act 17 USC §102(a)). Ideologically and practically these requirements were intended to encourage the expression and communication of creative ideas. In some interpretations, copyright law aimed to “protect the artists” from others stealing their ideas. Whether this has been successful or not – and even whether this is in fact a primary goal of copyright right law –  is an open topic for discussion. However, for the current cases at hand, the issues have revolved around the rights of creators in 1) input, i.e., works used to train AI systems (e.g., training data) and 2) output, i.e., works created in whole or in part by AI systems.

Notably, in the U.S., a copyright does not need to be registered with the U.S. Copyright Office to exist, but the registration must be done before copyrights are asserted. Some additional benefits may attach to early registration including enhanced damages and legal fees in copyright infringement actions, so while copyright attaches at creation, it is often advisable to register as early as possible.

Photography and Artwork

Simply put, the claim at the core of much of the copyright litigation regarding visual arts is the question: does the training of the AI systems infringe the copyright of the creators and/or owners of the images used in the training? The rationale in many is that it copies the copyrighted works.

In mid-to-late 2022 to 2023, a number of lawsuits were pending in the U.S. between visual arts creators and owners (licensees) and visual generative AI companies, including Stability AI, Open AI, and Midjourney to name a few. While seemingly ancient history at this point, these early cases indicated to IP attorneys and the public more broadly, that the initial wide-sweeping use of copyrighted works in training AI systems may not be as straightforward as hoped. These case have been grouped in an effort to provide a high-level overview; there are other excellent resources for a comprehensive tracking of the on-going litigation.

Image generator: One of the earliest suits was from Getty Images v Stability AI. Getty accused them of scraping and misusing over 12 million of its photos to train models in the system. Stable Diffusion was trained on 5 billion image-text pairs (which to be precise, Stable Diffusion has provided to a third party, LAION, and directed them to create the pairs). Getty also sued under trademark law (the Lanham Act, Delaware trademark law) and unfair competition laws. Getty also filed suit in the UK. Notably, some images output by Stable Diffusion contained the Getty’s watermark. The case is currently pending.

Class action: Another early suit was initiated by three artists, Sarah Andersen, Kelly McKernan, and Karla Ortiz and their lawyers against Stability AI, Midjourney, DeviantArt, alleging direct and vicarious copyright infringement, DMCA violations, right of publicity breach and others. They asserted that the companies have infringed the rights of “millions of artists” by training their tools on scraped images [from the web] without consent. Notably, Plaintiffs argue that the AI generated images compete with original images in the marketplace and therefore threaten the artists career path, an important element to keep in mind in the discussion of fair use, below.

Literature and Writing

By this time, you’ve almost certainly heard of the case launched by the New York Times against Microsoft and OpenAI based on an assertion that chat bots had been trained using NYT content. While this case includes both an ‘input’ and an ‘output’ assertion, the input assertion aligns closely with the visual arts claims. NYT claims the input is based on the scraping of article content to train the systems. The output claim is related to the fact that ChatGPT and others produce text that is remarkably similar (in some cases identical) to NYT article text, including articles behind the NYT paywall. Whether the text is freely available or sequestered behind a paywall however, has little to do with the copyright infringement issue. It does, however, likely factor into an unjust enrichment element of fair use, as discussed in the forthcoming fair use section.

Similarly, comedian Sarah Silverman and other authors sued OpenAI and Meta claiming copyright infringement, amidst other issues, and claiming that OpenAI directly infringed their work by using it as part of the training data to train ChatGPT. Earlier this year, a judge dismissed the majority of their claims, but their claim of direct copyright infringement is still pending.

In mid-March, three authors led a class action suit against Nvidia, claiming copyright infringement for Nvidia’s unauthorized use of their copyrighted books to train its AI model NeMo.  This case differs in the source of the dataset. The claim asserts that Nvidia copied, not from the authors’ works themselves, but rather trained its models on a dataset (“The Pile”), prepared by a third party and hosted on Hugging Face (an AI collaboration platform). The Pile, however contained in large part a dataset called Book3, which contained the contents of nearly 200,000 books, some of which were by the authors leading the suit.

Many have referenced the Google books case, which, while not a generative AI case, seems to have parallels to this case. The case, finally resolved in 2015 after nearly a decade of litigation, was a class action against Google for its project Google Books, in which Google and university libraries scanned books from the libraries’ collections. The court concluded “Google’s unauthorized digitizing of copyright-protected works, creation of a search functionality, and display of snippets from those works are non-infringing fair uses.”[1] The application of fair use, and thus decision in Google’s favor, hinged on an element of “transformativeness”, discussed in more detail in the fair use section. However, and most relevant here, the Google Books case established that the use of copyrighted material as input does not necessarily mean the output will constitute an infringement.

Music and Lyrics

Copyright issues around music can become even more complex, as music often contains different rights, even just within copyright law. For example, the creative works in a fixed tangible medium (e.g., “copyright-able works”) can include lyrics, musical compositions, and the recordings of the music itself. Oftentimes, in the music industry, these rights are owned by different parties, even for the same song or album. (If you followed the Taylor Swift saga, you may have gathered the layered rights – e.g., between the recorded masters versus the musical compositions and lyrics themselves, which is the reason Taylor was able to rerecord and rerelease “Taylor’s version” of her entire discography to circumvent contractual limitations of her former label).

AI-generated music, like other forms of creative expression, have been popping up everywhere. For now, we will set aside issues of copying the voice of an artist – and the deepfakes issue in general — which raises issues closer to rights of publicity and likeness, where AI-generated covers performed by artists both living and seemingly raised from the dead are covering YouTube. These issues will likely be addressed more aggressively by criminal and/or tort law, like fraud.

In October of 2023, Universal Music (and others including, ABKCO and Concord) sued generative AI company Anthropic for, inter alia, copyright infringement, specifically for unauthorized use of copyrighted song lyrics in the training of its large language model, Claude (and Claude 2).

Here, while the copyright issues may be similar as the aforementioned creative works, the music industry itself comes with a different context and a different history, for example, its history of sampling. The industry has a history of sampling from other work as part of creative expression. However, in the 90’s a series of cases shifted the approach to sampling, now requiring sampling licenses to avoid copyright infringement when sampling another artist’s work. It will be interesting to see if this line of thought is followed, perhaps creating a new type of license, at a different price point, for AI training.

In the meantime, campaigns and industry groups are sprouting up to join collective voices in an attempt to influence the future of this aspect of the industry. For example, the Human Artistry Campaign is one such group, outlining principles of AI applications “in support of human creativity and accomplishment” and sponsoring petitions.

Others, however, see the use of an AI system and its process of input and output as another tool of music production. This is a comparison that will come up throughout the series, as is often the case in the development of new case law and societal norms, practitioners in various areas of IP law attempt to compare and contrast the use of generative AI systems to other, already established products and processes.

Conclusion

While all of this litigation, expensive as it may be, is helping to draw attention and bring some very bright minds to engage in creative and convincing argumentation to help in steering the direction of the interactions between generative AI systems and the public, it is also creating one more obvious challenge. AI model developers are trending toward decreasing transparency in the datasets used to train their models. This is troubling, as data transparency is one of the primary elements required for the implementation of fair governance.[2]

Doctrine of fair use

Many of the broader questions and much of the litigation hinges on how far and to what extent we as a society believe copyright protections should extend. In the U.S., the doctrine of fair use is a judicially-created legal construct that attempts to answer this question. In the above cases, many of the defendants are claiming it[3] and the plaintiffs are generally arguing against its application.

While fair use in its current state is a U.S.-originating doctrine, concepts similarly attempting to answer the question of the reach of copyright (for example, in cases of free speech or educational purposes), exist globally. Article 9(2) of the Berne Convention provides a three-step test that has become an international standard for assessing the permissibility of copyright exceptions generally. The UK copyright law has a list of exceptions codified, including “text and data mining for non-commercial research and criticism, review and reporting current events,” among others. The European Commission provided a Copyright Directive, transposed in 2021which also includes a variety of exceptions. That being said, exceptions vary and interpretation and implementation vary widely.[4] The doctrine of fair use also allows exceptions to copyright enforcement, intended to prevent copyright from stifling free speech and/or research, though the codified law is open ended, rather than a list of exceptions.

Thus, we will briefly dig into the current status of the fair use doctrine and its potential application to many of the generative AI cases we are seeing. First, though, a reminder that copyright law carries with it many complexities, and it bears repeating here, the below should not be used as legal advice if you are facing a copyright issues. However, as an exercise to aid understanding as cases and claims evolve, below is a brief analysis of the fair use doctrine, as the principles, if not the doctrine itself, is at the heart of both the litigation and much of the broader questions we, as societies, are wrestling with in these decisions. These factors are considered a balancing test. The 4 factors are found at §107 of the Copyright Act of 1976[5] and they are briefly summarized and considered as follows:

1. The purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes

This is often called the “transformativeness” factor, and considers predominantly the level of transformation that has been achieved between the original and created work. In generative AI cases, the works may be highly transformative (prompt the system “a photo of a Matisse cat in space” may provide a cat with the face of Henri Matisse or a cat with many flourishes in its fur, floating through space), or it may be minimally transformative (prompt the system “a photo of a pigeon in Rome” may provide a photo that looks as though you could have snapped it on your last visit there). Another relevant consideration in this factor is that of purpose, namely purpose of the use of the copyrighted material, which quite obviously will vary depending on the use case, (e.g., generated text could be use by a company for its website, or generated text could be used by a teacher to help introduce a new concept to a 3rd grade class).

As mentioned above, this factored weighed heavily in the Google Books case, as Google had not just scanned portions of the book, Google Books also included a search function and snippet display, which helped the public learn about the books without providing a substitute for the books. It will be interesting to see how this case law is applied, if at all, as much of this litigation matures.

Another recent case of significant note in this area was decided in 2023 by the US Supreme Court, Andy Warhol Foundation for the Visual Arts, Inc. v Goldsmith. The court ruled that Warhol had infringed the copyright, as artist Andy Warhol’s changes were insufficiently transformative. Briefly, photographer Lynn Goldsmith photographed musician Prince in the early 80s. A few years later, Vanity Fair licensed the (unpublished) image for artist Andy Warhol to use as a reference photo for a silkscreen illustration. The license stipulated the photo was to be used only once. Andy Warhol went on to use the image for the basis of his Prince Series, notably without asking or notifying Goldsmith. Condé Nast used the image in 2016 as a cover image, with no attribution. The controversial US Supreme Court ruling seemed to draw a line in the sand limiting the application of fair use in, specifically, the visual arts when it concluded that Warhol’s work was infringing due to its lack of “transformativeness” considered with the fact that both visual pieces were aimed toward similar commercial streams.

2. The nature of the copyrighted work

In some cases this factor is weighing if the work is more factual (e.g., scientific or technical) versus if it is more creative. The nature of the work can also be its status as published or unpublished. This factor, too, is a case-by-case analysis for generative AI matters. However, one element to note, for example, in the NYT case, is regarding the publication status of the copyrighted work (e.g., published or unpublished, wherein the definition is essentially whether or not the work was readily available to the public). Without introducing further sub-factors or elements outside the scope of this piece, it is safe to assume that the use of article text that was behind a paywall will weigh this factor heavily in favor of the plaintiff, as the works behind the paywall were not readily available to the public.

3. The amount and substantiality of the portion use in relation to the copyrighted work as a whole

This factor looks at the similarity between the original and the newly created work. It also factors in the relative amount of the work that was “copied”, for example, the Google Books case hinged on this element, namely in that the portion of the books copied and displayed by Google Books was relatively small compared to the length of the book. This factor will likely weigh heavily for the output side of the copyright claims. For example, in the Getty case, the fact that some images even still portrayed the Getty watermark could weigh quite heavily toward a finding against fair use in this factor.

4. The effect of the use upon the potential market for or value of the copyrighted work

This factor considers the impact of the newly created work on the market or value of the copyrighted market. The principle of unjust enrichment is at play here, namely if one party receives a benefit from the other party without a proper exchange. In many cases, the output of generative AI systems might very well compete with and/or replace the current or original works. For example, an AI-generated image may very well compete with and replace a photograph of, for example, a sunset.

In many of the cases touched on above, this is a primary assertion by the authors and artists, namely that the AI-generated work will replace their works in the market, and that the generated works are only possible due to an impermissible copyright infringement. It will certainly be interesting to see how the courts apply this factor.

Conclusion

So while the doctrine of fair use is anything but straightforward in its application, the broader ethic behind the doctrine seems fairly consistent with our understanding of fairness. However, with the advent of generative AI and its requirement for data at scale, new questions are being raised. It will be very interesting to see how countries and cultures choose to adapt their laws, and if, perhaps too optimistically, we can hope for international harmonization of copyright laws around matters of generative AI, even if it requires the creation of sui generis property right . Because for now, when looking at the matter globally, even if just noting the vast divergence in what is or is not considered outside the reach of copyright law, and the diversity of views just within the US, one may conclude, “today, fair use is not just copyright policy; it is cultural policy, freedom of expression policy, and technology policy.”[6]

 

[1] Additionally, “The purpose of the copying is highly transformative, the public display of the text is limited and the revelations do not provide a significant market substitute for the protected aspects of the originals.”Authors Guild v. Google, Inc. 13-4829-cv

Accessed here: https://law.justia.com/cases/federal/appellate-courts/ca2/13-4829/13-4829-2015-10-16.html#:~:text=The%20court%20concluded%20that%3A%20(1,are%20non%2Dinfringing%20fair%20uses.

[2] For more on transparency and current trends, see Training Data Transparency in AI: Tools, Trends, and Policy Recommendations https://huggingface.co/blog/yjernite/data-transparency Accessed 15 April 2024

[3] For example, “We believe that the training of AI models qualifies as a fair use, falling squarely in line with established precedents recognizing that the use of copyrighted materials by technology innovators in transformative ways is entirely consistent with copyright law,” OpenAI wrote in a filing to the U.S. Copyright Office.

[4] While many efforts have been made to harmonize IP laws internationally, copyright law remains one that can still vary significantly depending on the territory.

[5] 17 U.S.C. § 107 (1)-(4)

[6] 1895 – Fordham law review Vol 92, 2024 “Fairness and Fair Use in Generative AI”

 

All articles in this series