
COASE-SANDOR INSTITUTE FOR LAW AND ECONOMICS RESEARCH PAPER NO.25-03
Judge AI: Assessing Large Language Models in
Judicial Decision-Making
Eric A. Posner & Shivam Saran
THE UNIVERSITY OF CHICAGO LAW SCHOOL
January 2025
Acknowledgment: This working paper series is generously supported by the Coase-Sandor
Institute at the University of Chicago Law School.
Coase-Sandor Institute for Law and Economics || @UChicagoLawEcon
Judge AI: Assessing Large Language Models in Judicial Decision-Making
Eric A. Posner and Shivam Saran1
January 17, 2025
Abstract. Can large language models (LLMs) replace human judges? By replicating a prior 2 x 2 factorial experiment conducted on 31 U.S. federal judges, we evaluate the legal reasoning of OpenAI’s GPT-4o. The experiment involves a simulated appeal in an international war crimes case, with two altered variables: the degree to which the defendant is sympathetically portrayed and the consistency of the lower court’s decision with precedent. We find that GPT-4o is strongly affected by precedent but not by sympathy, similar to students who were subjects in the same experiment but the opposite of the professional judges, who were influenced by sympathy. We try prompt engineering techniques to spur the LLM to act more like human judges, but with little success. “Judge AI” is a formalist judge, not a human judge.
“I predict that human judges will be around for a while.” – Chief Justice John G. Roberts, Jr. (2025)
Introduction
Will AI ever replace human judges? Chief Justice Roberts suggests not. We explore the question by instructing a popular LLM to decide cases under experimental conditions nearly identical to those in which a group of human judges decided the same cases. We find that the LLM performs differently from the judges by following the law more accurately than the judges did. But does that mean the LLM was a better judge—or a worse judge? The LLM’s formalism closely matched that of a group of student subjects who, most would agree, were not qualified to be judges. In this paper, we describe our method and results, and conclude that the answer to our question may depend less on AI’s progress than on jurisprudential questions that have stumped scholars for centuries.