In previous years, the absolute favourite topic of my Postgraduate Law class on Online Dispute Resolution was the fascinating realm of new blockchain-based platforms, which always captured the attention and imagination of students. The allure of vast wealth in cryptocurrencies during the bull market, combined with intellectual possibilities like adapting the existing legal framework for international commercial arbitration to accommodate newcomers such as Kleros and Aragon, made it an exciting topic. Students eagerly delved into recent cases; some even considered purchasing tokens from these protocols to experience the system first-hand.
On the other hand, our exploration of Artificial Intelligence (AI) in Online Dispute Resolution seemed a bit lacklustre. While some students were aware of the COMPAS system’s racial biases in predicting recidivism rates in Florida’s Broward County criminal justice system and the scandal surrounding Amazon’s biased hiring algorithm, these issues seemed disconnected from civil disputes. Furthermore, few had encountered AI systems in action or personally felt their impact. Cultural references to sci-fi novels by William Gibson and Robert A. Heinlein also somehow failed to pique interest in the subject.
Everything changed in the Spring of 2023 when I belatedly discovered ChatGPT. Sharing my findings with the class and projecting it on the big screen during a tutorial, I realised my students were already well-versed in its capabilities. Together, we engaged with the free version of this popular AI tool, exploring various prompts and discussing the outputs.
For instance, we tasked ChatGPT with writing a short essay assigned to the same cohort only weeks prior. Although not brilliant, the resulting essay proved acceptable. We witnessed the AI’s refusal to write a poem admiring Donald Trump while readily composing one for Joe Biden (suggesting a potential political leaning). We posed simple legal problems, such as non-payment for goods or failure to provide prepaid services. Superficially, the AI’s responses seemed sensible, but the legal professionals in the room quickly recognised its disregard for consumer protection laws, often favouring the offending party over the injured one. Lastly, we asked ChatGPT to resolve the Shamima Begum case, and it comfortably aligned with the UK Supreme Court’s judgment. This practical exploration surpassed our prior theoretical discussions on hypothetical algorithms and their capabilities.
The lesson was clear: the future was already upon us, staring us in the face. How long would it take conservative Britain to introduce the first AI judge, starting perhaps with small claims? With the persistent pressures on the civil justice systems in England, Wales, and Scotland, such a development may be imminent. After all, the world already saw a report of a judge in Colombia using ChatGPT to issue a court ruling.
However, before taking such a radical step, it would be prudent to thoroughly examine the inner workings of ChatGPT and similar Large Language Models (LLMs). How does the algorithm generate its textual outputs? What transpires in the digital space between a prompt and a response?
The best explanation I’ve found thus far, accessible and relatively non-technical, is Stephen Wolfram’s acclaimed article, “What is ChatGPT doing… and Why Does It Work?” published in February 2023. I highly recommend reading this article to anyone interested in the topic. I will summarise some of its salient points below, acknowledging any inaccuracies as my own.
Essentially, ChatGPT and other LLMs do not “think” or “analyse” in the conventional human sense. Instead, they construct sentences word by word. For example, if an LLM needs to complete the sentence, “The best country in the world is…,” it does not consult opinion polls, geography books, or happiness indices. Instead, it refers to its training dataset to determine how frequently the next word appears in similar sentences. If the training dataset represents the entire Internet, the LLM evaluates how often the next word appears in such sentences on the Internet. In a hypothetical scenario, it might discover that “the UK” is mentioned in 5% of cases, “the US” in 6% of cases, and “Italy” leads the chart with 7%. Conversely, a country like Tuvalu might have a significantly lower index, such as 0.001%.
In this example, I made up the numbers to illustrate the principle. Please disregard the specific figures and focus on the underlying concept. When I actually attempted to have ChatGPT select “the best country in the world”, it declined, citing “individual perspectives and priorities” as an excuse. However, with some persuasion, I managed to coax it into providing a list of the top 10 best countries in the world, where none of the countries in my hypothetical example appeared.
Returning to the original example, how does an LLM determine the word to conclude the sentence if no policy layers or controls restrict its choice? According to Wolfram, the LLM does not always opt for the most frequent word to avoid producing bland and identical outputs for every user. Instead, it occasionally selects lower-rated word combinations, following a set of criteria such as “temperature”. This approach adds creative flair and ensures each user receives a unique response. With the appropriate “temperature”, Italy, the UK, and the US will frequently feature in different users’ outputs. On the other hand, Tuvalu is unlikely to appear at the end of the sentence unless an exceedingly unusual “temperature” is set; even then, the probability remains small.
That’s the process—adding one word at a time and repeating until the AI reaches the desired word count. Does truth ever come into this process? Not at all. An amusing Reddit discussion highlights ChatGPT’s claims of authorship for various passages it never actually wrote. Does the LLM fact-check its outputs? Not in the least, hence the problem of “AI hallucinations.”
Presumably, a technological solution will emerge to address these concerns. Building something like “TruthGPT,” a layer within an LLM that cross-references factual correctness against a well-defined database, seems feasible to my non-technical eye. Such a layer would disallow incorrect statements, prompting the AI to regenerate the response. Should TruthGPT become a reality, a significant objection to AI acting as a judge would be assuaged. Although the problem is serious, it appears surmountable.
But there is also another potential concern regarding the operation of an LLM in the context of digital justice. All AI responses, in essence, answer one question —what people on the Internet might say about X. In other words, we cannot expect our AI judge, if built upon an LLM, to provide brilliant and unique legal analysis. Instead, it will furnish a mediocre approximation of what other people have previously expressed. Do not anticipate imaginative leaps or remarkable insights from our hypothetical AI judge, “their Lordship Chief Justice, AI.” Should we ever succeed in obtaining legal decisions from an LLM, those decisions will be… average.
Somewhat unexpectedly, this strongly resembles the voting mechanism utilised by successful blockchain-based dispute resolution protocols. Take Kleros, for instance, which involves selecting anonymous jurors from holders of Kleros blockchain tokens. The system incentivises jurors to reach unanimity on the case submitted electronically. Jurors who align with the majority receive extra tokens while dissenting jurors lose some of theirs. This approach draws on the work of game theorist Thomas Schelling and his concept of Focal Points, the solutions around which people tend to coordinate in the absence of communication. Each juror ponders, “What might the other jurors (whom I never met) say about this case?” The choice determines whether the juror gains or loses money. The winning solution will not be the fairest, most insightful or brilliant – it will be the most average one.
Thus, a parallel emerges between the voting mechanism in Kleros and the hypothetical AI judge built on an LLM. Both technological innovations in Dispute Resolution transport the parties and the judges (or arbitrators) to the Land of the Mediocre.
Is that necessarily a bad thing? The approach is not entirely foreign to the legal realm. For example, the standard of a reasonable man (“a man… guided upon those considerations which ordinarily regulate the conduct of human affairs”) has long been a central concept in tort law and delict. Similarly, commercial law features the standard of reasonableness, which, according to one commentary, describes a “fictional businessman possessing and exercising those qualities of attention, knowledge, intelligence, and judgment that international business requires of its members for the protection of its own interests and the interests of others.” There is a line between ‘average’ and ‘reasonable’, but this line might be rather thin.
The literature on judicial decision-making, likewise, suggests that the judiciary is, to some extent, a representative institution, and human judges are influenced by the majority or public opinion. The advent of new technology will introduce some fresh dimensions to this debate.
Whether we like it or not, soon, the question will no longer be whether “judges”, at least at the lower level of Dispute Resolution hierarchy, rely on public opinion, but rather what constitutes suitable public opinion and what safeguards should be in place to counteract any potential excesses stemming from this approach.