
John Bliss 1/3/24.
At various points in the past year, it has been widely reported that generative AI systems have aced the bar exam at the 90th percentile, floundered on law school exams, and failed to show utility in real-world legal tasks. Yet, the latest empirical findings push back against each of these claims, highlighting the importance of updating our views as the technology advances and new research findings are released.
The Bar Exam
When researchers found that GPT-4 passed the bar exam, it was initially reported (very widely) that it had scored in the 90th percentile.[1] This seemed to suggest a new world where a machine could outperform 9 out of 10 lawyers on an established (if deeply flawed) metric of legal competency.
But a PhD candidate at MIT, Eric Martinez, notes that when this percentile is calculated against first-time test-takers, GPT-4 scored in only the 62nd percentile on the UBE and only the 42nd percentile on essays.[2] There is a vast difference between a world where AI can produce 90th percentile legal writing—perhaps inducing lawyers to defer to AI’s expertise—and a world where it can produce only mediocre and likely very flawed 42nd percentile legal writing.
Still, a 62nd percentile performance on the UBE is an extraordinary moment in the history of the legal profession. GPT-4 appears to be the first non-human to pass a bar exam. Moreover, the leap from GPT-3.5 (released November 2022), which scored below the 1st percentile, to GPT-4 (released four months later in March 2023), scoring at roughly the 62th percentile, suggests a remarkable arc of progress in legal AI. It is plausible that new AI applications will soon reach the 90th percentile on the bar exam, or much higher.
Law School Exams
Similar to the bar exam study, the research on law school exams shows dramatic improvement over the past year. GPT 3.5 was barely achieving passing scores. But GPT-4 can yield as high as an A- or an A with the best-performing prompts, where the AI is “grounded” with teaching notes as a reference when taking the exam. With this prompting strategy, GPT-4 scored 100% on some multiple choice exams and above the median on essays.[3] The same study found that students performed their work significantly faster when aided by AI. It also found that when otherwise low-performing students used GPT-4 on an exam, they jumped an astounding average of 45 to 50 percentile points, from the bottom to the middle of the class.
This research suggests impressive AI capabilities, while also revealing limitations. Most strikingly: the students who would otherwise be expected to achieve the highest grades received no benefit or even scored significantly lower when using AI on their exams.
When the same researchers ran a study where students used GPT-4 for real-world tasks, they found very similar results: greater efficiency, high qualiity with the best prompts, and a strong benefit to otherwise low-scoring students.[4] This might suggest that using generative AI can speed up lawyers’ work and help improve low-quality legal writing. But the apparent lack of benefit to high-performing students may suggest that this technology falters at more advanced legal tasks (or that these students need more experience discerning when the AI is helping them and when it is not).
Other exams
GPT-4 also seems to have scored fairly well on the LSAT (at roughly the 88th percentile) and the MPRE legal ethics exam.[6] On the MPRE, it answered 74% of the questions correctly, which exceeds the average human test-taker, who answers roughly 68% correctly, and the passing score in all jurisdictions (ranging from 56% to 64%). Having missed roughly one out of every four legal ethics questions suggests that GPT-4 might not, on its own, be a reliable source of advice on the content of the rules of professional responsibility.
What’s coming next?
These exams have never been perfect proxies for legal competence. Moreover, it is plausible that a next-token-predicting AI is very good at “playing the game” of law exams, such that its score says little about this technology’s utility in the real world of legal practice. Nevertheless, continued research on AI’s exam performances can provide a helpful benchmark. As the technology advances, these studies can demonstrate progress over prior AI systems based on relatively clear metrics of quality (e.g. grading rubrics).
[1] Daniel Martin Katz, Michael James Bommarito, Shang Gao, and Pablo Arredondo, GPT-4 Passes the Bar Exam (Working Paper, March 15, 2023), https://ssrn.com/abstract=4389233.
[2] Eric Martínez, Re-Evaluating GPT-4’s Bar Exam Performance (Working Paper, May 8, 2023), https://ssrn.com/abstract=4441311.
[3] Jonathan Choi & Daniel Schwarcz, AI Assistance in Legal Analysis: An Empirical Study (Working Paper Aug. 16, 2023), https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4539836(finding that GPT-4 when prompted well passed law exams above the median, achieving an A- and an A under the best-performing prompts); Jonathan H. Choi et al., ChatGPT Goes to Law School, 71.3 J. Legal Educ. 387 (2022) (finding that GPT 3.5 passed law exams with a grade of roughly C+).
[4] Jonathan H. Choi, Amy Monahan, and Daniel Schwarcz, Lawyering in the Age of Artificial Intelligence (Working Paper, Nov. 9, 2023), https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4626276.
[5] Craig S. Smith, Google Unveils Gemini, Claiming It’s More Powerful Than OpenAI’s GPT-4, Forbes (Dec. 6, 2023).
[6] See https://openai.com/research/gpt-4; https://www.legalontech.com/resources/generative-ai-passes-the-legal-ethics-exam.