John Bliss, 8/28/24
In the spring semester, I led a seminar at Denver Law titled “AI and the Future of the Legal Profession.” For their final projects, three students conducted experiments evaluating AI’s performance on law school exams, with one project extending into the summer. They drew inspiration from research by legal scholars showing that unassisted LLMs can score as high as B+, well-prompted LLMs can score as high as the A range, and some students using these tools can dramatically improve their exam performances (by as much as 45 percentile points). These student projects bring a crucial new perspective to this empirical inquiry: unlike previous research where participants had minimal LLM expertise, these students are AI Jedis who had spent a full semester honing their skills with cutting-edge legal AI tools. Their aim was to explore the ceiling of legal AI capabilities in the context of law school exams, while dissecting the challenges that arise along the way.
These students have graciously permitted me to share their methodologies and findings, as well as their names—I extend my gratitude to Omar Ochoa, Leila Roberts, and Shay Schulz.
Methodology
Each student created custom GPTs (via ChatGPT+) tailored for legal analysis and exam-taking. They provided these GPTs with a range of instructions, including: generating outlines before writing, structuring essays in IRAC form, and demonstrating their reasoning by providing step-by-step analysis. In two studies, the GPTs were “grounded,” meaning they were provided with relevant legal reference materials such as class notes, readings, or outlines. All three students engaged in a process of iterative refinement during the exam, extensively collaborating with the AI rather than simply copying and pasting its initial output.
One student developed a particularly sophisticated workflow involving multiple GPTs designed for different purposes in the exam-writing process: drafting, fact-checking, resolving inconsistencies, identifying errors, and even grading each draft (the grading GPT accurately predicted that the final version of the essay would receive a B+).
Results
The results of these experiments were mixed. Two showed an impressive AI performance. On a Civil Procedure exam, a course the student had taken from a different instructor the year prior, the GPT-assisted essay was ranked 4th best in a class of 70 students. The student also scored 9/10 on the multiple-choice section, well above the class average of 6.5. Another strong performance came in an Accounting for Lawyers class, where an AI-assisted student achieved a B+ on materials entirely unfamiliar to them.
However, the other two experiments revealed setbacks and unexpected declines in student performance. On a Construction Law exam, a student (enrolled in the course) who had scored an A- without using AI found their GPT-assisted grade drop to a B-. Similarly, a pair of students found a similar result when taking my spring 2024 1L Property exam—both had taken this course with me the year prior—saw their grade drop from an A- when they were enrolled in the course to a GPT-assisted B a year later.
Interpretation
The grades produced by these student experiments are roughly in the range found in the existing non-Jedi literature. The students noted several challenges. All three emphasized practical hurdles during the exams. In two studies, students reported they had to manually re-type GPT responses because the exam-taking software would not allow pasting. In one case, a student spent the first thirty minutes of a 2.25-hour exam reformatting spreadsheets from the exam prompt (for the accounting class) to make them comprehensible to the AI. Across the board, students struggled with word limits, lengthy LLM outputs, and time management.
Consistent with prior research, the LLMs seemed adept at producing decent but not excellent legal exams. This observation aligns with prior findings that using ChatGPT dramatically helped otherwise low-performing students, but tended to hurt the performance of those who would be otherwise be at the top of the class. The students researchers also noted that their findings are consistent with prior research indicating that LLMs excel at objective tasks and straightforward rule-based analyses, but often falter at more nuanced legal reasoning.
On the two property exams I graded, some of the AI-assisted writing was truly fantastic. Nearly all of the points lost on each exam were due to the same two errors: they overlooked a key doctrine (prescriptive easements) and failed to provide several relatively obvious counterarguments. The rest of the work was top-notch. If these two issues had been addressed, I would have given both exams an A, and I believe they would have been competitive for the best in the class.
Conclusion
The AI Jedis may be approaching mastery, but they’re not there yet. While a tailored GPT in the hands of a highly skilled user can produce impressive results at times, it still falls short of consistently yielding top grades. My experience grading the property exams suggests that we may be on the horizon of AI-assisted top-of-the-class exam performances. The next wave of advancements will come not only from technological progress but also from students honing their AI-interaction techniques. While law schools today may house only a handful of experienced AI users, tomorrow’s cohort—composed of current college and high-school students notoriously well-versed in generative AI—could arrive as full-fledged Jedis.
These student studies, while small-scale and not broadly generalizable, suggest new directions for future research, considering not only improvements in legal AI applications but also improvements in the student proficiency using these tools.