B+ on Autopilot: LLM Achieves new Grade on Unassisted First Drafts

John Bliss 2/16/24.

A new study from a group of law faculty at the University of Maryland provides an update on LLM performance on law school exams. The lead author conducted a similar study last spring (2023), finding that GPT-4 scored as high as a B. The new study finds that GPT-4-Turbo scored as high as a B+ on Fall 2023 exams (in civil procedure, torts, and environmental law), although it scored lower in other classes, achieving a mean and median GPA of around 2.7. This study suggests that GPT-4 Turbo represents an improvement over GPT-4 (pre-Turbo). This is an illuminating finding, but there is a risk that readers of the study (or readers of headlines and social media about the study) will misinterpret it as suggesting a ceiling of current capabilities. Instead, it should be read as an assessment of the LLM’s unassisted first draft.

For example, consider the LLM weaknesses identified by the authors, including that it analyzes in a conclusory fashion without an IRAC format and that it cites some outdated legal standards. A sophisticated user can address these weaknesses by requiring the LLM to use IRAC and providing model work product (few-shot prompting) and uploading class notes or other legal materials to prevent the use of outdated legal standards (grounded prompting). Indeed, these prompting strategies were used by Choi and Schwarcz in their recent study, where GPT-4 scored as high as an A- and an A on law exams. Moreover, none of the exam studies have looked at law-specific LLM applications, such as Lexis+ AI and CoCounsel within Westlaw, which might help address these limitations. Students can even create their own custom GPTs tailored to essay writing for law exams.

When thinking about real-world applications, it is important to note that a sophisticated user would not stop with the LLM’s first draft. They would converse with the LLM through the process of issue spotting, developing arguments and counterarguments, outlining, drafting, and revision. Choi and Schwarcz showed that some, but not all, students in their study used LLMs very effectively to improve their exam performance. We may now have a growing number of students who are very sophisticated in their use of LLMs, and who are producing even better legal AI outputs than we have seen in the research to date (e.g. the students currently taking classes that I and many other law faculty around the country are teaching on the use of emerging AI in legal practice).