273 Ventures – the legal technology product studio and consultancy co-founded in 2022 by Daniel Katz and Michael Bommarito (see below for the full founders list) – has launched a dataset containing over 150B tokens of foundational legal, regulatory, and financial text that can be used to build or customize compliant AI models, we can reveal.
Kelvin Legal DataPack’s mixture of curated legal data sources can be used by organizations to train embedding models or as part of training or tuning extractive or generative models. It is the first large-scale legal dataset with clear provenance and commercial use rights. It also includes enrichment and annotation to support a wide variety of use cases.
Bommarito and Katz – who famously tested successive GPT models on the bar exam – envisage that law firms will be able to use Kelvin and their own data to build their own large language models.
The launch comes at a time when large tech companies such as OpenAI are under fire for allegedly violating IP and copyright laws in scraping the internet to train their LLMs. OpenAI is facing lawsuits from the likes of comedian Sarah Silverman, who as part of a class action is suing OpenAI, claiming that it breached copyright laws in ingesting her 2010 book.
Speaking to Legal IT Insider, Bommarito, who is a professor of law at Michigan State University College of Law and CEO of 273, said: “One of the things that firms need to know for sure is that the training data isn’t going to bite them. What if Sarah Silverman’s case succeeds? In creating this dataset, we said there can be no breach of contract or IP in using the data.”
The dataset only includes relevant material from the likes of EDGAR and PACER and 273 Ventures’ CSO Katz, who previously co-founded LexPredict and is a professor of law at the Illinois Tech – Chicago Kent College of Law, said: “It is much better and more focused and relevant than a data set trained on the likes of Reddit.”
Katz and Bommarito have been working on the dataset in stealth mode for a year. It comes to market five months after the launch of BloombergGPT, a new large-scale generative AI model that has been trained on a wide range of financial data to support natural language processing (NLP) tasks within the financial industry.
Katz said: “Kelvin can’t write you a sonnet like GPT-4, but it has had a law diet and it will soon approach the scale of something like BloombergGPT.”
The Kelvin Legal DataPack also contains several separately available collections, including the Kelvin Contract DataPack with nearly 20B tokens. This Contract DataPack is designed to support common customer use cases, such as the development of playbook automation and market comparison analytics.
One firm that has already been given a demo of Kelvin Legal DataPack is UK top 50 law firm Travers Smith, where chief technology officer Oliver Bethell told Legal IT Insider: “What Dan and Mike have done is super exciting and we are interested in being one of the first to take the datapack and either fine tune exiting models or create our own model from the ground up.”
The co-founders in full are CRO Jill Bommarito, a previous co-founder of LexPredict, and VP of training Jessica Melford Katz, who hold a Ph.D in Analytic Philosophy. You can read more about the background to the Kelvin Legal Datapack from Jill Bommarito here: https://kelvin.legal/why-kelvin-legal-datapack/