“A boost for AI”: Harvard Law School puts 6.5m cases online

Litigation analytics just got even more interesting in the US, after the Library Innovation Lab at the Harvard Law School Library last week officially launched its Caselaw Access Project API, which puts the full corpus of published US case law online in order to help developers build tools to analyse the findings. Between 2013 and 2018, the Library digitised over 40 million pages of US court decisions, transforming them into a data set covering 6.5 million individual cases, going back to 1658. Crucially, these records have been made available in a machine-readable format for anyone to use for free, working in partnership with Ravel Law, which in 2017 was acquired by LexisNexis.
“Having access to a database that someone else has built and having the entirety of case law available to download for processing with your own programmes is a big change,” lecturer and senior developer on the Caselaw Access Project Jack Cushman told Legal IT Insider. “We are not creating another LexisNexis database. We are creating a data set that people can use to find new ways to explore case law.”
For LexisNexis, which through Ravel and Lex Machina is leveraging its own database to provide in-depth analytics, this represents significant competition.
The project could prove vital for the development of artificial intelligence for legal applications, which has been thwarted by a lack of access to data with which to train software, albeit that companies such as Casetext, ROSS Intelligence and Lex Machina have built their own databases by scraping PACER, an electronic public access service of United States federal court documents.
The site has launched with four tongue in cheek example applications including Caselaw Limericks that randomly generates rhyming limericks from the case law and, neatly timed for the end of October launch, a Witchcraft in Law app that totals cases that cite witchcraft by state. But it is up to the legal, and legal-tech, industry to now see how they can capitalise on this resource to push the boundaries of the intelligence that can be generated.
“Some of the applications we have developed are useful. Some are fanciful,” said Cushman. “But the point is that case law can now be treated as data and analysed by computers. We have 6.5 million cases in our collection. No lawyer has ever been able to read even one per cent of that. But if you write a computer programme it can process across the entire data set in a matter of hours, providing new levels of insight from analysing case law. This could provide a real boost for the use of AI in the legal industry and we are excited to see what people will come up with.”
Cushman added that the project has provided inspiration for others looking to digitise case law in their own countries, suggesting that the available case law data set could ultimately become global. “We are definitely being asked by people elsewhere in the world about the process so that it can be replicated,” he said. The Library Innovation Lab’s next step will be creating a case law browser to help people explore the data. “But this is really about other people picking up what we have done and building tools that can work for them.”
Currently litigation analytics capability in the UK is hampered by the fact that only a small percentage of case law is freely and publicly available. Vizlegal is making progress in its ambitious bid to create a global repository of indexed and structured legal judgments and filings, with UK Upper Tribunal (Tax & Chancery) decisions now converted from PDFs to searchable and easy-to-read HTML. However, in a post for Legal IT Insider, Vizlegal founder Gavin Sheridan said: “Unfortunately for us, and indeed for the public, enormous amounts of what arguably is “public” information is published in poorly formatted documents that are difficult or impossible to search or parse properly. Or worse – it’s simply not available at all.”
By Amy Carroll