Clifford Chance’s data science lab has published what we understand to be the first scientific paper in data science for legal. The paper looks at long length legal supervised document classification, the limitations that the current models impose on the length of the input text, and how to improve the results.
The paper is posted by head of data science Mirko Bernardoni; data scientist George Papageorgiou; senior machine learning engineer consultant Michael Seddon; and data scientist Lulu Wan.
Speaking to Legal IT Insider, Bernardoni, said: “When you do a LIBOR exercise or Brexit or banking you always end up as a lawyer working with hundreds of thousands of documents. Often, it’s a nightmare. You end up with so many different files. For example, in a syndicated loan it might be running for 10 years and you might have 10 different files. Probably the first problem you’re trying to solve is having the relevant documents and files grouped so that you can use machine learning tools for entity extraction.
“In this paper we focus on that initial phase: I have a huge number of documents; I have no idea where they are; what I want to identify is a syndicated loan, which is a different type of loan. I’m not going to look in the file name but within the content and we need to categorise that.”
Specifically, the paper shows that dividing the text into two segments and combining the results with a BiLSTM architecture to form a single document embedding can improve results.
Scientific papers are vetted and have to present entirely new information or present new solutions to existing problems. Bernardoni said: “We decided that it would be nice giving something back with a little more open concept and of course what you gain is a little bit of authority and it demonstrates when you talk about Clifford Chance that we know what we’re doing.” In an observation that will resonate with many across the legal industry he added: “One thing we notice in the AI world is that there are a lot of companies that really fuel the rumours that they are doing the ‘real stuff.’ Often they have a nice UI but there is nothing behind the scenes.”
You can read the paper here: https://arxiv.org/abs/1912.06905