Recently, there has been extensive debate surrounding the feasibility of developing artificial intelligence (AI) models without infringing upon copyrights. A prevailing opinion within the tech community maintains that training advanced language models—such as those underlying modern chatbots or image generators—necessarily involves the use of copyrighted materials. This practice, common among various technology companies and influential stakeholders, has resulted in numerous legal disputes alleging intellectual property violations.
A key factor placing artificial intelligence at the forefront of public discussion has been the release of ChatGPT, an AI system widely accessible to the general public that can hold coherent conversations and solve complex tasks.
Nevertheless, recent initiatives have emerged to support the ethical and legal training of such models. For example, a team of researchers recently released an extensive dataset known as Common Corpus, consisting entirely of public domain written materials. This dataset is comparable in scale to others traditionally used for training advanced generative models and has been made available through Hugging Face, a prominent open-source AI platform.
Despite these promising developments, Common Corpus faces considerable limitations due to the outdated nature of many of its contents. This is a direct consequence of temporal copyright restrictions. In the United States, for instance, works typically enter the public domain only seventy years after the author’s death, severely limiting the availability of contemporary material.
Furthermore, the United States Copyright Office recently declared that certain AI-generated images cannot be considered original works created by human authors, thereby excluding them from copyright protection. Among the primary arguments raised by the office is that automatically generated images produce unpredictable results that cannot clearly be attributed to a specific human or legal entity.
In conclusion, while there is widespread belief that developing advanced AI models without resorting to copyrighted material is unfeasible, recent efforts demonstrate that it is indeed possible, at least partially, to ethically and legally develop such models using carefully curated datasets composed exclusively of public domain content.