Gen AI startup Harvey has released a benchmarking framework for quantitatively evaluating the performance of large language models on real-world legal tasks. The model supplements, it says, prior work that measures LLM legal reasoning in more structured settings.
One of the challenges of evaluating LLMs on the tasks that lawyers do is that they are too complicated to be graded by multiple choice or ‘one-size-fits-all’ criteria. Harvey has instead converted time entries into model-based tasks, which are then divided by other relevant taxonomies such as practice area, type of work (transactional/litigation etc) and the portion of a matter that the task would take.
The following tables outline the major categories and distribution of BigLaw Bench core tasks:
Harvey then developed rubrics to evaluate each task – an objective criteria needed to accomplish a given task. Penalties are given for incorrect tone, length, irrelevant material, toxicity, and hallucinations. The rubrics must capture everything a model MUST do and everything it MUST avoid.
To convert scores into a benchmark, negative criteria were assigned negative score and positive criteria positive ones. The overall score is computed by taking the combination of positive and negative points and dividing it by the number of points available for a task.
Harvey also benchmarks the viability of a model’s sourcing of its answers, with a high score for correct sourcing and low score for lack of traceability and validation.
Harvey says it outperforms each of the leading foundation models.
Here it provides an example task from both the litigation and transactional categories for reference. A more comprehensive description of exemplar tasks sampled from BigLaw Bench, including associated rubrics, can be found here.
This release follows efforts from Stanford University and the UK Legal IT Innovators Group led by CMS’ head of innovation John Craske to drive standardisation and benchmarking in order to better enable buyers of Gen AI tools to compare their output.
Commenting on BigLaw Bench from Harvey, Legal IT Insider’s lead analyst Neil Cameron, author of our Gen AI and the Practice of Law Report, said: “As we noted in the LITI Gen AI in law report, law firms badly need an industry-wide, non-partisan Gen AI LLM benchmarking model. This will probably only truly emerge after a period of emergence of competing benchmark models followed by reconciliation, rationalisation and consolidation. So, the first phase of the Gen AI LLM benchmark ‘war’ starts here.
“We can only applaud any well-considered approach towards a methodology that can work towards being adopted as (or as part of) some future industry-standard benchmarking for measuring the usefulness and accuracy of Gen AI legal advice. The Harvey approach established two clear fundamental criteria – time entry comparison and source confirmation. The fundamental importance of source confirmation is surely uncontroversial and vital to any benchmark. The comparative time entry concept is more problematic.
“As anyone who has tried to analyse raw time entry data for useful ‘information’ (whether manually or using GenAI) will attest, it is fraught with it own baked in subjectivities. While attempts to normalise such data, such as ‘area of law’, ‘work types’, UTBMS and J-Codes, can provide some objective basis for comparison and analysis, these are often treated with slapdash accuracy by careless attorneys, and by more imaginative attorneys as much as a challenge to exploit for commercial purposes, as they are for support and enlightenment. Whilst they may be useful in comparing, as Harvey states, the extent to which “models can contribute to real, high-value legal work”, the degree to which it can be of any additional assistance is then reliant on the level of standardisation of, and compliance to, the “relevant taxonomies” to which Harvey then refers. As useful as that approach may be, we will still need some additional objective criteria for measuring the accuracy and usefulness of the LLM model response. That is the Holy Grail, and for that we eagerly await the next steps.”
You can download Legal IT Insider’s Gen AI report – which compares the output of multiple LLMs – HERE and below.
You can read the key takeaways of the first benchmarking meeting led by Craske HERE and below.
Gen AI and the Practice of Law Report – Download it here now
What version of Claude 3.5 was tested?
Haiku 3.5 or the recently released Sonnet 3.5 which outperforms Opus 3.5?
Important details for benchmark consideration.