Johanna Schandera and John Bliss 8/25/2024
The debate over generative AI’s reliability in legal practice has focused on “hallucinations”—instances where an LLM produces text with incorrect legal facts. A study of last year’s general-purpose chatbots found hallucinations in 69 to 88% of legal responses.[1] This seemed an alarming finding, though the study was not examining the best legal AI tools available (post). A new study by the same researchers provides an updated assessment of the AI applications that are going mainstream in the legal profession. They offer the first empirical evaluation of AI-driven legal research tools using RAG (retrieval augmented generation), which enhances AI responses by retrieving relevant information from a curated databased before generating an answer. [2]
Through their analysis of Lexis+ AI, Thomas Reuters’s Ask Practical Law AI, and Westlaw’s AI-Assisted Research, the study found a dramatically reduced rate of hallucination, falling in the 17 to 33% range.[3]
While this improvement is significant, it is important to acknowledge that there remains a concerning rate of hallucination. Claims by legal research companies that they provide “hallucination-free” technology are not supported by this research. Moreover, the study revealed variability across providers: Lexis+ AI was the highest performing, followed by Westlaw’s AI-Assisted Research and then Thomson Reuters’s Ask Practical Law.[4]
Another important contribution of the study was to establish benchmarking standards for evaluating RAG-based legal AI tools. The researchers used open-ended legal queries, which require nuanced analysis modeled after real-life legal uses cases. This open-ended approach contrasts with prior work using standard question-answer settings to assess the LLMs’ legal knowledge (Dahl et al., 2024) and capacity for legal reasoning (Guha et al., 2023).[5]
While the reduced hallucination rates are promising, they underscore the continued need for human oversight and verification in AI-assisted legal research. The variability in performance across tools also suggests that legal professionals should carefully evaluate and compare different AI solutions.
As the product landscape expands, empirical studies like this are vital for tracking improvements in AI reliability and refining our methods for assessing this technology.
[1] Isabel Gottlieb & Isaiah Poritz, Popular AI Chatbots Found to Give Error-Ridden Legal Answers, Bloomberg (Jan. 12, 2024), https://shorturl.at/acovG; Matthew Dahl, Varun Magesh, Mirac Suzgun, Daniel E. Ho, Large Legal Fictions: Profiling Legal Hallucinations in Large Language Models https://arxiv.org/abs/2401.01301.
[2] Matthew Dahl, Daniel E. Ho, Varun Magesh, Christopher D. Mannning, Faiz Surani, Mirac Suzgun, Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools, https://arxiv.org/abs/2405.20362.
[3] Id.
[4] Id. at 13.
[5] Id. at 7.