From B to A+ in Two Years: Charting the Legal-Reasoning Revolution

John Bliss , 7/9/2025

Last month OpenAI’s Sam Altman boasted that his team had “cracked reasoning.” He was referring to advances in “reasoning models” (like o1, o3, and o3-pro), which excel in step-by-step problem solving. These models are trained with “process-supervised reinforcement learning,” meaning they are rewarded for the quality of each reasoning step rather than just the final answer. According to the system cards for o1 and o3, the improvements are dramatic on math, science, and programming benchmarks. But do these gains extend to legal reasoning?

Early empirical evidence suggests they do, although the precise scale of the improvement is still undetermined. At the University of Maryland, Blair-Stanek et al. have released their third study of AI performance on law exams. Back in spring 2023, they found GPT-4 could score as high as a solid B (mirroring findings from Minnesota with GPT-3.5). The following year, GPT-4 turbo nudged that up to a B+ as the new high score. Now, these researchers have tested their first reasoning model (o3) on eight final exams in spring 2025. The result was pretty astounding. On three exams, o3 earned an A+, matching the top student in one class and surpassing every student in the other two. The other grades were also strong: A, A-, B+, B+, B. The authors note that the lone B was largely explained by a key court decision (Loper Bright) that the AI didn’t have access to (as it was too recent to be included in the training data). Moreover, one of the B+ grades could be discounted because the researchers “accidentally did not include the fundamental instructions about what is expected” (points awarded for reproducing relevant rules).

These results, sensational as they are, may even understate what these models can do. When the same researchers published last year’s “B+” paper, I stressed that they had only assessed AI in “auto-pilot” mode, as the exam questions were fed in just once and the AI was allowed just one response. A tech-savvy student (an “AI Jedi”) can iterate, spot obvious errors, refine prompts, and collaborate on the final product. They can even develop their own AI workflow, as my seminar students did (discussed here), building custom GPTs for legal research, drafting, revision, and feedback, while working with multiple AI apps (e.g. some prefer Lexis+AI for legal sources, NotebookLM for digesting large document sets, OpenAI’s reasoning models for developing arguments, and Claude for polished prose).

Yet, it’s also plausible that the Maryland results overstate the leap in legal reasoning. The study is small-scale and the grading was non-blinded (as the authors assumed anonymity was impossible because, unlike student writing, “AI prose is often perfect”). Without blinding, graders are susceptible to the “halo” and “horns” effects—positive or negative bias based on how one perceives a student (or, here, the AI). These researchers acknowledge these limits and plan to conduct randomized control trials (RCTs) with blind grading, which they believe can be accomplished by instructing the AI to insert human-student-like spelling and grammar errors.

Meanwhile, Dan Schwarcz at Minnesota Law has pioneered RCTs in this space, measuring both raw AI performance and student-AI collaboration on law exams and real-world legal tasks. In their latest study, Schwarcz and colleagues evaluated how students performed on entry-level associate work (such as motions and client emails) using a reasoning model. The study was conducted before the release of the cutting-edge o3 in April 2025, so the authors were confined to using a scaled-back preview version of o1. But the finding was stark. Working with the reasoning model boosted not only speed but also quality—especially improvements in “analytical depth.” Arguments grew stronger and counterarguments surfaced that students missed when they were unassisted by AI (i.e. the same students also completed tasks in a control condition without AI help).

These improvements were inconsistent and some legal citations were hallucinated, but the authors suggest these issues can be mitigated by combining reasoning models with RAG (retrieval-augmented generation as found in vLex’s Vincent AI, Lexis+AI, and Westlaw’s CoCounsel), which grounds analysis in actual legal sources. They suggest a multi-tool workflow pairing the well-sourced RAG outputs with the most advanced reasoning models (or simply using platforms that bring these capabilities together like Harvey).

Taken together, the evidence for a “reasoning revolution” in law is compelling but still provisional. In just two academic cycles, GPT-based models have vaulted from “solid B student” to valedictorian-level on some law exams. These models are still fallible and aren’t ready to ace every test or master ever legal task, but the trajectory over the past two years is unmistakably upward. I hope the Maryland and Minnesota scholars continue their year‑over‑year studies, which reveal how steep this upward slope really is—and how far and fast the legal profession may need to adapt.