
Photo by Jametlene Reskp on Unsplash
OpenAI claimed in 2023 that it’s impossible to build its models without using copyrighted material, turning that unsubstantiated claim into a fair use argument.
Well, they were lying.
In March 2024, Fairly Trained released their AI models that didn’t use copyrighted materials for training.
This month, EleutherAI released a training dataset of licensed and open domain text along with two AI models trained using that data:
EleutherAI, an AI research organization, has released what it claims is one of the largest collections of licensed and open-domain text for training AI models.
The dataset, called the Common Pile v0.1, took around two years to complete in collaboration with AI startups Poolside, Hugging Face, and others, along with several academic institutions. Weighing in at 8 terabytes in size, the Common Pile v0.1 was used to train two new AI models from EleutherAI, Comma v0.1-1T and Comma v0.1-2T, that EleutherAI claims perform on par with models developed using unlicensed, copyrighted data.
The performance is good for using so little data, which is set to improve once the entire set is used:
According to EleutherAI, the models, both of which are 7 billion parameters in size and were trained on only a fraction of the Common Pile v0.1, rival models like Meta’s first Llama AI model on benchmarks for coding, image understanding, and math.
Now that the lie is exposed and the fair use angle has been proven to be a non-argument, there are no more excuses for the genAI firms and a reckoning is at hand…