The Introduction of the First ‘Fairly Trained’ Large Language Model

OpenAI, in its filing to the House of Lords earlier this year, stated it would be impossible to train today’s leading AI models without using proprietary materials that had been subject to intellectual property protection. The news received considerable coverage online at that time.

There’s an age old saying: if it ain’t broke don’t fix it. And this holds true for every aspect of our daily lives – be it work, schooling or leisure activities. At the core of GPT’s public and legal defense for its controversial mass data scraping practices involving artificial intelligence models such as its GPT-3.5/4 large language models (LLMs) that power ChatGPT as well as competitors like Google, Mistral Meta Anthropic Cohere. is this argument: Critics contend OpenAI should have sought affirmative express consent or paid licensing fees for its copyrighted data usage, but the company maintains its practices are fair transformative use and fall within longstanding norms of the internet, where content has long been scraped to power search engine indexes or provide useful features without significant opposition or widespread outrage. As of November 2018, numerous lawsuits are currently ongoing over this issue.

But a new model is challenging that assumption – at least, the notion that building useful models requires copyrighted data.

KL3M (Kelvin Legal Large Language Model; also pronounced “Clem”) was developed by two-year-old startup 273 Ventures. Co-founded by Daniel Martin Katz of Illinois Institute of Technology’s law faculty as its Chief Strategy Officer (CSO), along with legal technology entrepreneur Michael Bommarito as its Chief Executive Officer (273 Ventures’ CEO). Together they previously co-founded LexPredict (an older AI legal startup that was later sold off to global law company Elevate).

KL3M was released for sale in late February 2024 but, today, made history as it became the first LLM to receive “Licensed Model (L) Certification” from Fairly Trained. Ed Newton-Rex founded and leads this non-profit organisation and Wired magazine was first to report this news.

Fairly Trained (L) certification is awarded only to companies who can demonstrate, through application and review process, that their AI model training data was obtained legally or public domain/open license, costing between $150 upfront/500 annually or $500 upfront/$6,000 annually; KL3M certainly met these criteria!

“Today we are very excited to share that the Kelvin Legal Large Language Model (KL3M) has received certification as being Fairly Trained,” Katz wrote on his account for social network X. It is now the very first LLM (in any category) with such certification.

Fairly Trained recently published a blog post outlining their certification of K3LM and four other entities — Voicemod which offers AI speech and singing models; music companies Infinite Album and Lemonaide; as well as Frostbite Orckings who specialize in artificial intelligence-powered solutions.

How Was KL3M Trained? According to Katz, who provided VentureBeat with an in-depth interview today via telephone interview, 273 Ventures has since its formation “painstakingly collected data that would not be considered harmful,” from sources including U.S. government document releases and legal filings – all available publicly.

“Initially we weren’t sure whether such an endeavor [training an AI model] was possible without using massive quantities of copyrighted information,” according to Katz. However, “we thought there may be at least some success in certain niches such as legal, financial and regulatory arenas where there exists significant amounts of material without intellectual property rights restrictions.”

Katz notes that not all industries provide uniform public domain documents and that this varies significantly across nations — for instance in the UK some government entities or agencies can exercise Crown Copyright on documents they produce.