Lawyers Replaced in Contract Review?

John Bliss 2/14/2024.

A new study finds that large language models (“LLMs”) perform contract review at near human-level accuracy, while dramatically cutting the required time and cost. This could suggest—as the authors conclude—that junior lawyers and LPOs are on the verge of radical disruption and even some degree of replacement. However, the article is very light on statistical reporting, casting a heavy cloud of doubt over the study’s implications.

Key Findings

Regarding accuracy, the researchers found that GPT-4 scored slightly ahead of junior lawyers and LPOs on one metric (determining contract issues) and behind the humans on another metric (pinpointing the location of contract issues). Not surprisingly, the LLM completed this work in much less time than the humans—with GPT-4 conducting the review in under 5 minutes, while senior attorneys took 43 minutes, junior attorneys took 56 minutes, and LPOs took 201 minutes. Furthermore, the authors note that the scalability of AI could yield even greater efficiency, because the AI can review multiple contracts at once. By eliminating this human labor, the authors calculate that the price of contract review drops by 99.97%.

Methods

The study appears to be well designed, drawing from anonymized real-world contracts and benchmarking the AI’s performance against the work of senior attorneys. The researchers seem to have prompted the AI well giving it the instructions that one would give a lawyer doing similar work, including “the target audience for the contract, pertinent background information regarding the contracting parties, and the specific scenario under which the contract was being negotiated.”

Limitations

Despite these intriguing findings, the implications of the study are limited by the lack of statistical reporting. The authors provide no detail about the sample size. The findings relating to speed and cost are so stark that they are likely significant. But it is difficult to know how to read the findings on LLM accuracy without more methodological detail.

Moreover, as with most other studies of LLM legal capabilities, the authors use horizontal chatbots rather than vertical, law-specific (or contract-law specific) applications. It is plausible that other apps would outperform the LLMs tested in this study.

The authors may be right that we are entering an era of “LLM dominance in legal contract review.” But, in order to say this with a high degree of confidence, we would need more detailed empirical analyses as well as observations of how this technology is being adopted in practice.