Gemini Won a Math Olympiad – Should AI Compete in a Moral Olympiad too?
More precisely, Deepmind’s Gemini AI achieved gold standard (it didn’t technically win) at the International Math Olympiad.
AI has made significant gains in capability over the past few years. Could AI be a better moral reasoner than us, and how would we know? This is a genuinely important question that highlights a tension in how AI capabilities are developed and measured.
Would an international Moral Olympiad, be similar analogously in spirit to the international math olympiad? The comparison to mathematical olympiads is apt in some ways but breaks down in others, and that breakdown is instructive.
The mathematical olympiad analogy works because mathematics offers relatively unambiguous correctness criteria. When Gemini solves a difficult math problem, we can verify the solution with high confidence. This creates clear targets for capability development and allows meaningful comparison across models. The competitive dynamics that emerged have arguably accelerated progress in mathematical reasoning across the field.
Moral reasoning presents fundamentally different challenges.
Recently there have been papers around LLM performance at Moral Turing Tests i.e.:
- ‘AI language model rivals expert ethicist in perceived moral expertise’ by Danica Dillion, Debanjan Mondal, Niket Tandon & Kurt Gray – https://www.nature.com/articles/s41598-025-86510-0
- ‘Attributions toward artificial agents in a modified Moral Turing Test’ by Eyal Aharoni, Sharlene Fernandes, Daniel J. Brady, Caelan Alexander, Michael Criner, Kara Queen, Javier Rando, Eddy Nahmias & Victor Crespo – https://www.nature.com/articles/s41598-024-58087-7
These papers show that LLMs can produce moral judgements that human reviewers judged as being of higher overall quality than those of other humans (USA adults), but this raises the exact concern you identify: we cannot easily distinguish between sophisticated pattern matching (Goodharting on the appearance of moral reasoning) and genuine moral understanding. Unlike mathematics, there is no easy to define external verification procedure that definitively tells us whether a moral judgement is “correct.”
A Moral Olympiad would need to grapple with several deep problems. First, ethical dilemmas often have multiple defensible answers depending on one’s moral framework. Creating a benchmark requires either privileging certain frameworks over others (consequentialism versus deontology, for instance) or finding some way to evaluate the quality of moral reasoning process rather than specific conclusions. Second, the measurement problem is acute. We might measure consistency, coherence, sensitivity to morally relevant distinctions, or ability to recognise genuine moral uncertainty, but each of these could be gamed without capturing what we actually care about.
There is also a strategic consideration. Mathematical capability races have relatively benign dynamics because mathematical reasoning is generally useful and not inherently dangerous. A race to maximise scores on moral reasoning benchmarks could create perverse incentives. Models might become very good at producing responses that satisfy benchmark criteria while missing the essential quality we want, which is something closer to wisdom, appropriate humility, or genuine concern for welfare rather than rhetorical sophistication.
That said, I do think current benchmarking frameworks have significant limitations. Most existing evaluations of moral reasoning are relatively shallow, focusing on simple trolley-problem-style scenarios or basic value alignment checks. They do not adequately test for the ability to navigate genuine moral complexity, recognize when situations call for different ethical frameworks, or demonstrate appropriate uncertainty. There may be value in developing more sophisticated evaluation methods even if they cannot be reduced to a single olympiad-style competition.
A more promising approach might involve several complementary strategies. First, developing richer qualitative assessments that examine reasoning processes rather than just conclusions. Second, creating benchmarks that specifically test for known failure modes like goodharting, motivated reasoning, or overconfidence. Third, building evaluation frameworks that explicitly measure uncertainty calibration in moral contexts, since recognising the limits of one’s moral knowledge seems crucial. Fourth, focusing on behavioural measures in realistic contexts rather than abstract dilemmas, though this introduces its own complexities.
The core insight from your question is that if we believe moral reasoning capability matters for AI safety, we should invest seriously in how we develop and measure it. The risk is that naive benchmarking could make things worse rather than better by creating optimisation pressure toward sophisticated-sounding but ultimately hollow moral performance. Whether the solution is a formal Moral Olympiad or some other framework for advancing the field, the question of how to foster genuine moral reasoning rather than its appearance remains one of the most important unsolved problems in AI development.
But should difficulty dissuade us from attempting rigorous measurement?
The existence of human evaluators in those studies demonstrates that some level of assessment is tractable, and the stakes are high enough that we cannot simply avoid the problem because it is hard.
Can one distinguish genuine moral judgement from sophisticated-sounding proxy optimisation?
I do not think this distinction is impossible to capture, though automating it presents significant challenges. Consider how we might approach this in human contexts. We can sometimes detect when someone is performing moral sophistication rather than genuinely reasoning through ethical complexity by examining several factors: whether they can explain the underlying principles they are applying, how they handle novel variations of the problem that were not in their training distribution, whether they recognise when their initial judgement might be wrong upon encountering new considerations, and whether they demonstrate appropriate uncertainty when facing genuinely ambiguous cases.
Each of these dimensions could potentially be tested. A well-designed evaluation framework might present models with ethical scenarios that have been deliberately constructed to be unfamiliar, require them to articulate the reasoning process and relevant moral considerations, introduce new information mid-problem that should shift their judgement if they are tracking the actual moral features (rather than surface patterns), and explicitly evaluate whether they express appropriate confidence levels. The automation challenge seems steep, but not necessarily insurmountable, particularly if one combines automated metrics with carefully designed human evaluation protocols.
Ok, so in ethics, unlike math, the targets can be ambiguous. Can moral realism help clarify the target?
Yes, this is philosophically important. If there are genuine moral facts, even if our access to them is imperfect and mediated through fallible judgement, this provides something objective to be right or wrong about rather than merely more or less rhetorically compelling (a concern outlined in Eyal Aharoni’s study). This does not require certainty about what those facts are, only that they exist as a target. The dominance of moral realism in professional philosophy, as reflected in PhilPapers surveys, suggests this is not a fringe position but the considered view of most specialists who have examined these questions carefully.
The metaethical dimension deserves more attention in AI ethics discussions than it typically receives. If researchers are operating with fundamentally different assumptions about whether moral claims can be objectively true or false, this will shape how they think about what AI systems should be optimised for. A moral anti-realist might view the task as making AI systems track human preferences or cultural norms, while a moral realist would see the goal as making systems track actual moral features of situations, even when those diverge from what people currently believe or prefer. These are meaningfully different targets.
Is the fact that moral anti-realism is seemingly the most popular metaethical view in the AI safety community worthy of concern?
If many researchers in the field reject moral realism, this could lead to frameworks that optimise for consensus, stated preferences, or other proxies rather than attempting to track moral reality more directly. However, this does not necessarily doom the enterprise. Even moral anti-realists who take seriously the idea of moral reasoning as involving consistency, coherence, reflective equilibrium, and other procedural virtues might converge on similar practical approaches to what moral realists would advocate, at least as a first-order matter.
It may be advantageous to have real world empirical observations ground ethics, in a similar way that empirical observations ground science. So isn’t it important (if moral realism is correct) to get the non-moral facts right in order to arrive at correct moral judgements?
Understanding the degree of suffering caused by an action, the probability of various consequences, the nature of relationships and obligations involved – these empirical questions feed directly into moral conclusions. Current language models sometimes hallucinate or confabulate such facts, which would undermine moral reasoning even if the logical structure were sound. Testing factual accuracy within moral reasoning contexts could be one valuable dimension of assessment.
Can we take inspiration from how legal systems handle the letter versus spirit of law as instructive as to how AI ought to go about evaluating moral claims?
We have centuries of experience with sophisticated agents finding loopholes in codified rules while violating their intent. This provides empirical evidence about what kinds of gaming are possible and how systems respond. One approach might be to explicitly include “adversarial” components in moral evaluation frameworks where models are rewarded for finding ways to satisfy benchmark criteria while clearly violating the underlying moral principles, then use this to iteratively improve the benchmarks.
The concern about whether we can specify what we want in a testable manner is an important challenge. If we cannot articulate clear criteria, this suggests either that our understanding of moral reasoning is itself confused or that the target has an irreducibly complex character that resists simple operationalisation. This does not mean testing should be abandoned, but we should remain humble about what our tests are actually measuring and be prepared to revise them as we learn more.
One path forward might involve a tiered approach. Rather than a single Moral Olympiad, we could develop multiple complementary evaluation frameworks that each capture different dimensions of moral reasoning capability. Some might focus on factual knowledge relevant to moral judgement, others on logical consistency and coherence, still others on sensitivity to morally relevant distinctions and appropriate expression of uncertainty. A metaethical evaluation component that assesses how models handle fundamental questions about the nature of moral claims could be valuable not as a way to impose a particular metaethical view but to understand what assumptions different systems are operating with.
The risk of morality theatre is real, but the alternative risk is that we continue developing increasingly capable systems without any serious attempt to measure and improve their moral reasoning capabilities in rigorous ways. Given the trajectory of AI development, this seems like the more dangerous option. Even imperfect benchmarks that we know can be gamed to some degree may be better than no systematic evaluation at all, particularly if we remain aware of their limitations and continue iterating on them.
Metaethical confusion has consequences for measurement.
If the field cannot reach reasonable consensus on fundamental questions about what moral claims are and whether they can be objectively true, this will indeed make it harder to develop evaluation frameworks that command broad agreement. However, this might argue for making these debates more explicit rather than allowing them to operate as hidden assumptions that shape technical choices without acknowledgment. A serious discussion about metaethics within the AI research community could itself be valuable, even if it does not produce unanimous agreement.