John Bliss, 1/26/26
As reviewed in prior posts, there is a growing body of research on how LLMs perform as law students. Recent studies turn the question around, asking how LLMs can function as graders.
As described in their working paper, a group of law professors had AI grade more than 200 essay exams across Civil Procedure, Contracts, Torts, and Corporations. They found a strong correlation between human and AI grading, especially when the LLM was given a rubric—which made Pearson correlations jump from the .66-.80 range to the .78-.93 range. On this evidence, the authors conclude that LLMs “have the capacity to replicate human grading of law school exams with a high degree of accuracy.”
Another recent (non-law) experiment from NYU’s business school took a different path to a similar conclusion. The instructors used an AI voice agent (ElevenLabs) to administer oral exams, asking students to explain their final projects and demonstrate course knowledge through a case discussion. The transcripts were graded by both the human instructors and a “council” of LLMs (from Anthropic, Google, and OpenAI). According to the lead instructor, the human grades “mostly agreed with the LLM’s grades,” and the AI council outperformed the humans in some respects: it was “more consistent across students,” it “graded more strictly but more fairly,” and its “feedback was better than any human would produce.”
These results should be read as early, tentative evidence in a complex field. But they may point to some practical, limited uses of AI in the grading process. An LLM can serve as a second reader, flagging where the instructor varies from their own rubric. It can also help generate feedback. For instructors contemplating oral exams as a way to limit student AI usage during assessments, the NYU study suggests a way to scale this approach by having the AI administer the exam and the human grade it (perhaps with some AI assistance).
This all raises deep questions about the grading process: explainability, accountability, due process, bias, privacy, confidentiality, fairness. What the empirics are telling us is that the technology is growing capable enough to make those questions no longer hypothetical.