Real world reflections on Gen AI hallucination and risk

In a year of big advances for legal Gen AI tools, it is nonetheless clear that Stanford University’s controversial paper on hallucination continues to cast a long shadow over product updates and new releases. Hallucination – or, put very simply, making stuff up – is not new to us in this fast-moving post Gen AI world, but buyers and prospective buyers of new tools are in many cases struggling with how to put the risk in context.

To recap just briefly, Stanford published research in January – Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools – which found that the AI research tools made by LexisNexis (Lexis+ AI) and Thomson Reuters (Westlaw AI-Assisted Research and Ask Practical Law AI) each hallucinate more than 17% of the time.

Heavily rebuffed by those vendors and criticised by many for its methodology (Stanford initially used practicing advisor tool Practical Law not Thomson Reuters’ legal research tool Westlaw, for example), Stanford’s research (including its updated report) is nonetheless an important contribution to an evolving dialogue often made difficult by an onslaught of new vendor releases and little in the way of benchmarking or standardised testing.

The problem with the way it has been received is that a) Stanford’s tests were with regard to legal research but are being treated by some as applicable to all Gen AI tools and b) the results are likely to be received and interpreted through the existing lens of the receiver: Gen AI haters treat the findings as validation to avoid the technology altogether, whereas hypists latch onto the criticism around the things that Stanford got wrong. As always, the better answer probably lies somewhere in the common ground.

Appropriate use cases?

Speaking to people who are using and testing Gen AI tools on a regular basis, it seems clear that one of the practical challenges for organisations is in getting users to understand how to use Gen AI tools and what their limitations are. Legal research was always going to be one of the most challenging nuts to crack, although that doesn’t take away from the fact that the progress being made in that area is significant.

At UK law firm Addleshaw Goddard, one of the earliest to run proof of concepts around and adopt Gen AI tools, partner and innovation group head Kerry Westland told us that there are two reasons for the Stanford experiment finding such high degrees of hallucination. “The first is that they were asking questions that lawyers probably wouldn’t in practice,” she says. “The other piece is that law is built of things on top of each other. If you have a case based on another five or six, Gen AI tools are not magic and are not currently going to work through that line of cases to give you the exact answer. If you ask obscure questions, it will get it wrong. It’s about understanding the strengths of the tool.”

On the point about asking questions that lawyers wouldn’t ask in practice, we would encourage you to look at Legal IT Insider’s fascinating own research in our new Generative AI and the Practice of Law report, out now.

What Westland is most concerned about is that Stanford appeared to imply that these tools are poor, and she says: “I say they aren’t necessarily that great at legal research yet. We’ve got the tools we use to be quite accurate at contract review but don’t expect 100% in something it’s not good at.

“We see this as part of your toolkit. You could find something that is not a complete answer but part of your research. It’s a good starting point and then you go back and check it. In Lexis and Thomson Reuters you can look at the case itself.”

Training people on how to use tools is key and Addleshaw’s director of innovation Elliot White says: “You still have to put the right input in: there’s a tendency to want to take short cuts and just trust what comes out and we continue to emphasise internally that you can’t just trust it. You don’t just chuck all your ingredients in a microwave and expect a great meal.”

At Clifford Chance, which has been one of the firms at the forefront of rolling out Gen AI tools including Microsoft Copilot, training on the firm’s Gen AI policy is mandatory. In an interview with Legal IT Insider out shortly, we hear about the extensive work that Clifford Chance has done in preparing its Gen AI policy; ensuring adherence to it; working out what the right tools are for the firm; as well as the work it is doing in training supervisors; with CIO Paul Greenwood commenting: “Supervisors need to understand that this type of tool may hallucinate.”

Someone else who agrees that you have to do the work to evaluate appropriate use cases and train your people is Casey Flaherty, co-founder and chief strategy officer at legal technology collective LexFusion, which advises clients on innovation and new tech adoption. Flaherty told Legal IT Insider: “One annoying aspect of Gen AI is that you have to do the work to evaluate the tools and people don’t have time for that. Right now, it shouldn’t be a categorical ‘yes we’re using it,’ or ‘no, we’re not using it.’ Where it’s shown to be additive, use it, whether that is enhancing quality, increasing speed, or bringing down cost. A lot of what we have seen achieves lower-level associate levels of accuracy infinitely faster and less expensive.”

Flaherty has long been a proponent of better training in the legal sector. In 2014 he co-founded Procertas, a competency-based technology training program to improve lawyers’ use of Word, Excel, PDF and PowerPoint.

“I’ve been saying this forever, but we’re really bad at training. We can’t even train people on Microsoft Word,” he says.

In Flaherty’s recent in-depth article GenAI in Legal: Progress, Promise, Pain and Peril, he makes the observation that Gen AI conversations seem trapped at the 101 level. Low-information participants occupy a spectrum but seem to primarily fall into three distinct categories (you will recognise two of these from above):

Hypists are convinced that GenAI changes everything, everywhere, immediately; we should welcome our new robot overlords and the utopia they will inevitably usher in; resistance is futile.

Haters believe this is just another empty distraction, full of sound and fury, signifying nothing; we should all return to our regularly scheduled programming.

Headlinists know something is in the air and, from the chyrons they’ve skimmed, understand it—whatever “it” is—seems dangerous; we should hit the pause button and retreat to our status-quo panic room until some authority figure lets us know we are completely safe.

Unsurprisingly, none of these are particularly helpful.

Comparing what with what?

While hallucination presents a particular kind of unfamiliar risk, Flaherty says that one of the underlying prevailing issues with how we treat any new technology is the perception that it is competing with human perfection.

“The biggest question that people are not asking is accurate relative to what?” he told Legal IT Insider. “If relative to perfection, that is an impossible standard and not one to which we hold ourselves. We have studies showing the error rate in briefs. Showing the error rate from editors at Bloomberg and Westlaw and LexisNexis when seeing if a case is overturned. There is absolutely no evidence of lawyer infallibility and a mountain of evidence to the contrary.”

The infallible human rears its head at key milestones and Flaherty said: “We’ve been having this fight since 2005 in eDiscovery with the superiority of TAR to manual review. Academic papers looked at the myth of the human as the gold standard. Lawyers don’t want to accept it but that’s the only basis on which we can have a conversation.”

The reality that we accept in our daily lives is that technology can go wrong, but that it works as an unquestioned additive. Flaherty says: “I rely on Google Maps all the time. I can’t tell you how many times it has led me slightly astray. I was in Turkey and Google Maps was trying to get me to walk through a military base with a sign of a gun. People drive through lakes and there are tragic incidences of them driving off a cliff. There is a definite error rate in GPS but we all rely on it and expect people to use caution and judgment. That is going to be true of any technology, including this technology.”

At ILTACON 2024 in Nashville, at a debate hosted by eDiscovery vendor Level Legal, Rob Robinson, managing director of ComplexDiscovery OÜ and former aviator made the observation that the use of autopilot in planes is unchallenged, but that ultimately humans have to understand the readings and data around it and be prepared to take control at any time.

None of this is to say give Gen AI tools a free pass. But Flaherty says: “It’s completely valid to test a tool and conclude that it’s not sufficiently performant. But your standard should be based on ‘is it better’ not ‘is it perfect.’

[email protected]