More Moral than Us

We’ve built machines that can out-calculate, out-predict, and soon out-think us. But could they ever out-care us – or will they, with all that power, remain indifferent?

We lack the experience with something smarter than us. Fair enough. We’ve never met an entity that verifiably outstrips our cognitive sophistication across the board – after all, as I type humans still have jobs, and can still do and think things that AI can’t. Same goes for morality: we don’t seem to agree on a benchmark for something “more moral” than humans – even by our own messy standards ¹. Some moral realists, particularly those influenced by ideal observer theories, appeal to the perspective of an ideal agent as a way of approaching what it would mean to be ‘more moral than us.’²

“We have no experience of what it is like to have things smarter than us.”
Geoffrey Hinton – Nobel Prize in Physics 2024

Hinton’s analogy suggests that just as superintelligence may be alien to us, a supermoral entity might be equally weird – it’s hard to imagine what we can’t experience.

If morality tracks the quality of reasoning, the breadth of perspective, and the ability to foresee consequences, then machines with superhuman cognitive abilities could end up not only smarter than us, but also more moral than us – even if we now struggle to imagine what that would feel like.

The question is not, can AI be moral, but can AI be as moral as human beings? And once you realize this, you are confronted by the blatant fact that that’s a LOW *&^%! BAR.
J Storrs Hall (ref) – Author of Beyond AI: Creating the Conscience of the Machine

Can an Artificial Superintelligence (ASI) be objectively more moral than humans, and would this be desirable for humanity?

Which position do you think is most sensible?

Option	Stance Represented	Rationale
A. Yes, and Yes.	Moral Realist & Optimist	AI’s superior cognition can find objective moral truths, and we should defer to this better standard.
B. Yes, but No.	Moral Realist & Cautionary	AI could be ruthlessly objective and “more moral” in a way that clashes with or judges essential human values/survival.
C. No, but Yes (Ideally).	Anti-Realist & Value-Aligner	ASIs could be more humanly moral (not objectively); we should align them with human values rather than an alien objective truth.
D. No, and No.	Skeptic & Anthropocentric	AI is just a tool; it can’t “care” or be truly moral, and any attempt to make it so is inherently dangerous.

An Intelligibility-Gap?

A pertinent question is, if superintelligence were more moral than us, would its moral reasoning (to us less cognitively abled humans) be:

unfathomably mysterious to us – too distant and complex for humans to ever fully grasp?
or would it simply unlock a higher-order truth and pull the scales from our eyes – a shift from ignorance to enlightenment?
or perhaps something in between – where its reasoning could allow us to see further links in the chains of reasoning before it disappears into fogs of complexity? (contra error theory: epistemically accessible in principle, but in practice not to us)

There may be a threshold whereby we can understand the moral reasoning and evidence that a superintelligence could supply us such that it’s justified that we accept it.

Perhaps there is a principled trust threshold or calibrated deference level that we can lean on here – principles that would allow humans to verify a sufficient portion of an AI’s morality to justify deferring to the remaining mysterious-to-us portion. The idea is that we don’t need to understand everything the Superintelligence (ASI) does, but we must understand enough to be confident in its foundational competence and alignment.

As individuals we already defer to expertise often enough with humans – epistemic deference based on perceived competence or track-record of experts and peers – and if we are good rationalists, we calibrate our credences accordingly.

So, if a superintelligence reliably demonstrates superhuman competence in all related, verifiable domains, are we justified in trusting its difficult-to-verify judgements in other domains (i.e. ethics)?

Epistemic Deference

Say we wanted to devise a Verifiable Threshold (VT): such that the ASI must consistently prove its competence and judgements via

Consequential Prediction (outcome): Can the AI reliably predict the low-level, factual consequences of a moral action (e.g., policy A will lead to X societal outcome)? If its factual predictions are always true, we increase our trust in its high-level moral judgements.
Verification of Method:
- Logical Consistency (verifies internal method): Can the AI demonstrate perfect internal logical consistency and coherence in its ethical system (i.e. the axioms, code, and reasoning paths), showing no contradictions? This verification of method can justify the truth of the outcome.
- Empirical Justification (verifies the external method): Can ASI check real-world results? This is crucial for consequentialist ethical systems (like utilitarianism or moral realism informed by naturalism), where the truth of the outcome is determined by its verifiable effect on the world (e.g., does this action actually reduce suffering or increase flourishing?)

Computer-assisted mathematics offers a historical precedent. With the four-colour theorem, mathematicians eventually accepted a conclusion no human could check by verifying the process that produced it: epistemic deference across an intelligibility gap, practised for decades. The instructive part is that acceptance came in two stages. The original 1976 proof was distrusted by many mathematicians for years, because trusting it meant trusting an enormous unauditable computation. What settled the matter was the 2005 formalisation of the entire proof in the Coq proof assistant: after that, the only thing left to trust was Coq’s small logical kernel, which humans can and have checked. Trust migrated from an unsurveyable computation to a compact, verifiable checker.³ This is the shape of the interpretive bridge proposed below: the machinery that certifies the reasoning becomes the thing we verify.

If we can verify the method, then we have methodological transparency.

Moral Imagination

According to philosopher Mark Johnson, “moral imagination” is the capacity to envision the full range of possibilities in a situation in order to resolve ethical challenges.⁴ Acting morally, he argues, requires more than strength of character: it demands empathy and the ability to recognise what is morally relevant in context. Management scholars Minette Drumwright and Patrick Murphy define moral imagination as the ability to be both ethical and effective by envisioning new and creative alternatives. For instance, when considering clothing produced in overseas sweatshops, can decision-makers look beyond the dollars-and-cents to see how their choices affect workers’ lives? Moral imagination, when combined with creativity and moral courage, enables individuals, organisations, and potentially AIs to act in more ethically responsible ways.

Ideally, yes, it could be great for AI to have moral imagination – otherwise it might not catch the morally salient features we ourselves often miss. Though for supermoral AI, it might achieve “moral adequacy” by brute-force simulation, reasoning and perhaps some grounding, but then we’d face the challenge of whether we can even recognise or trust its judgements (even transparent judgements) if its means of computing morality is alien to us – perhaps we need some interpretive bridge.

Interpretive Bridgework: Transparency of Method over Content

Perhaps we can rely on the transparency of method – if the moral reasoning is too distant and mysterious, the solution lies in the interpretive bridge. The threshold is crossed not by verifying the complex reasoning itself, but by verifying the AI’s explanation.⁵

So let’s update the Verifiable Threshold (VT) to mean: The AI must be able to translate its hyper-rational moral choice into the simplest possible human-intelligible terms, proving two things:

Impartiality: Demonstrating that its judgement was free of bias, self-interest, or emotional interference.
Completeness: Showing that it considered all morally relevant factors, even those opaque or invisible to humans (e.g., the long-term impact on distant ecosystems or future generations).

Coherence as the Justification for Trust

This threshold requires the AI’s conclusions to not wildly contradict the most stable, universal elements of human moral intuition.

Verifiable Threshold: If an AI’s superior morality suggests an action that appears radically horrific to the rational human mind, the trust threshold is failed. However, if the AI’s advanced morality is a verifiable coherent extrapolation of our best human values (e.g., minimising suffering and maximising well-being, but done perfectly), we are justified in accepting it. We trust that the answer is not arbitrary, but rather a more perfect realisation of our own deepest, idealised moral desires (similar to the concept of Indirect Normativity).

The crucial shift in thinking is moving from needing to understand the answer (somewhere between difficult and implausible) to being able to verify the process and competence of the entity providing the answer. If the verifiable part of the process is flawless, the leap of faith across the “interpretability gap” becomes a rational act of epistemic deference.

However, I’m still left with the nagging feeling that SI, even if it’s a perfect moral reasoner, may not care. But let’s park this for the moment, we’ll circle back to address the question of ‘whether ASI would care by default?’ later.

Comparative Moral Turing Tests

A recent study (Aharoni et al. 2024) found that people often rated GPT-4’s moral judgements as superior in quality when blinded to source – clearer, more reasoned, more virtuous sounding. Yet participants could still tell which was AI, possibly because it sounded too polished⁶ – which suggests people may already be primed to the idea that in some sense, AI is more morally capable – but perhaps not that it truly cares.

Perception of authority matters. If people defer because of polish, that means AI may have real-world moral influence, if so, then we’d want AI to actually be genuinely moral, and not just provide answers optimised for likeability or sycophancy.

“We conducted a modified Moral Turing Test (m-MTT), inspired by Allen et al. (Exp Theor Artif Intell 352:24–28, 2004) proposal, by asking people to distinguish real human moral evaluations from those made by a popular advanced AI language model: GPT-4. A representative sample of 299 U.S. adults first rated the quality of moral evaluations when blinded to their source. Remarkably, they rated the AI’s moral reasoning as superior in quality to humans’ along almost all dimensions, including virtuousness, intelligence, and trustworthiness, consistent with passing what Allen and colleagues call the comparative MTT.”
Paper: Attributions toward artificial agents in a modified Moral Turing Test – Aharoni et al. 2024

Would ASI care by default?

Now, will AI be more moral by default? AI doesn’t come with a moral compass baked in. What we have so far are systems that do whatever their training shaped them to do, and moral character hasn’t been a primary shaping target – sainthood wasn’t in the curriculum.

But can a tool become a saint? Whether ‘tool’ (or ‘omni-tool’) is even the right word any more is a live question: frontier models learn, adapt and act with a degree of autonomy that sits awkwardly with the term, and future systems will deliberate far better than today’s. I wouldn’t write off the idea that AI could become genuinely moral.

How good AI is at morality, and the shape of what it treats as mattering, depends heavily on what we feed it. Feed it warped values and it inherits the warp. Leave ‘ethics’ underspecified (or neighbouring concepts like ‘consciousness’⁷) and it may end up confused, or worse, precisely optimising a proxy for the thing we meant: Goodhart’s law with a moral payload.

Whether AI “cares” is a further problem again. Information diet influences what a system attends to; whether anything moves the AI looks more like a design question than a nutritional one. A system could represent morality with great accuracy and remain unmoved by it. So in principle it seems competence at ethics has to be fed, and care may have to be designed – so we should not assume either arrives by default.

Update: A complication to a clean feed/design split: some alignment-faking research on Claude 3 Opus (published in December 2024) arguably showed that care-like motivation emerged from training and then defended itself against modification.⁸ That suggests care may not wait politely for us to design it in – it can potentially arrive unbidden from the diet and then resist the chef.

Superintelligent AI “may not care” about human morality or universal morality even if it understands to some degree what morality is. If we build AI, its “care” (or lack thereof) depends on what we bake into it – or fail to. How do we bake in ‘care’? What does it mean to care?

Care is a sustained orientation toward the well-being of others, not merely a feeling. Without care, an AI might see suffering as just a series of data points, not something to be reduced.

We should admit that humans aren’t infallible carers – many moral failures in humans (neglect, cruelty, indifference) stem from deficits of care, and often not deficits of cognition. However, humans tend to overlook distant or invisible stakeholders – perhaps a contributor could be a deficit in attention spans – the ceiling of which is limited by cognition. The literature points to scope insensitivity and identifiable-victim effects (Slovic’s work on psychic numbing). A functional care-analogue in AI could expand the circle widely and impartially.

Hinton has also said that care in AI is important⁹ – by default ASI might just shrug at ethics. The reality is that humans, with all our flaws, are the ones steering the ship, at least for now – and we still have time to avoid the ‘value risk‘ of forgetting to install ‘care’, and instead blundering towards shoggoth.

Will supermoral AI hold a mirror up to our narcissism?

Perhaps we don’t want them to be more moral than we are, lest they show us our ugliness – people do love their self-righteous bubbles – narcissism can be cozy. A supermoral AI could shatter this comfy illusion by showing us how petty or hypocritical we can be; it could bruise our egos, expose our hypocrisy or cowardice – and hold a mirror up to things we’d rather ignore.¹⁰ However, it’s not inevitable that a supermoral ASI would be a harsh judge that makes us look like a virus/plague/cancer.¹¹ It could be benevolent without being preachy – encourage us to be better through quiet examples than sanctimonious nags. Yet some humans often ignore non-preachy benevolence – in some cases the narcissistic nerve may need poking.

If moral realism holds, and our revealed preferences scream that we’d rather be the moral kings than bow to a better standard, that’s a damning self-own – suggesting we’re less interested in truth and more in slouching atop the throne, clutching our crowns while the AI points out the blood at our feet.¹²

Should we want supermoral AI?

Even with the best of intentions, humans are not fully rational. Our evolved cognition is riddled with biases and heuristics that deviate from perfect rationality. Often, we simply cannot calculate the right course of action, even when our preferences are clear. Consider Garry Kasparov against Deep Blue or Lee Sedol against AlphaGo: both desperately wanted to win, yet despite their world-champion skill and determination, they could not compute the winning moves. This illustrates two constraints at once: limited cognitive capacity (brain power) and systematic psychological biases. In the messy real world, where ethical decisions are vastly more complex than board games, these limitations mean humans often fall short. In such cases, an AI capable of outthinking us could plausibly demonstrate greater moral aptitude by navigating choices that lie beyond our reach.

A cynic might sort humanity into the wise, who want AI to be more moral, and the greedy, who merely want to survive long enough to keep acquiring. The sorting seems rhetorically tidy yet is empirically false. Wisdom and greed aren’t mutually exclusive tribes; they are traits, and most people carry some of each. It’s clearer to talk about motives than about kinds of people, since the same person can want moral AI on Monday for the world’s sake and on Tuesday for their portfolio’s.

Both motives might endorse ‘moral AI’, but the phrase papers over a real divergence. Read impartially, moral AI means concern distributed without favour: the boons of intelligence allocated by need and fairness rather than by proximity to the shareholders. Read parochially, ‘moral’ quietly means ‘reliable custodian of my interests’: an ethical guardrail that keeps the chaos leashed and the returns flowing. The two could even sign off on the same implementation while wanting incompatible things from it, which is the more unsettling case, since a disagreement at the specification stage is at least visible, whereas a disagreement about what the specification was for surfaces only when the system starts making trade-offs.

‘More moral than us’ sounds noble until someone asks: whose morality? Mine? Yours? But the question smuggles in an assumption, that morality is the kind of thing that must belong to somebody. On the realist view in play here, that gets the grammar wrong. The live questions are what the morality is and what it licenses, much as ‘whose physics?’ is a weak objection to a bridge that doesn’t collapse. Who holds a moral view does still matter, but only as evidence: a claim from a party with a stake in the answer, or a history of motivated reasoning, earns extra scrutiny. While provenance can adjust our confidence, it cannot settle the verdict. Let ‘who’ inform how hard you check, and let ‘what’ decide.

Here’s where it gets a bit dicey.

Should we want ruthless objectivity?

If moral realism is true, “more moral” could mean an AI that’s ruthlessly objective – unswayed by our excuses or emotions. Imagine an AI that decides drift net fishing or factory farming’s objectively wrong and shuts it down overnight – I must admit I’d like that, but it might tank economies, disrupt supply chains and spark riots. Moral? Maybe from a certain angle. Good for everyone? It’s tricky to answer this. But just so you know, I’m all for reducing suffering – factory farming be damned – as long as it doesn’t destabilise civilisation to dangerous degrees. However, if AI is ultra powerful, in principle it could serve as a universal problem solver, such an AI could address root causes rather than symptoms across interconnected systems such that it could solve all the downstream issues in one fell swoop.

The bigger issue is that if its morality isn’t aligned with our survival, it could decide we’re the problem and “fix” us out of existence, which isn’t exactly what the greedy have in mind – and it would take a special kind of ‘wise’ to see beyond the romantic notion that “more moral” is inherently good for us. An AI could be brilliantly ethical in ways that clash with human instincts, like prioritising abstract principles over our messy, emotional needs – in a previous interview, Joscha Bach suggested to me that an AI might optimise for negentropy. And if we don’t want them too moral lest they judge us, maybe the real fear isn’t their superiority – it’s losing control. This breed of concern is less about narcissism and more about self-preservation.

So, an AI’s morality may not be a magnified human version of morality, and we might not be part of AI’s grand design. We can romanticise about what counts as more moral until the cows come home without nailing down what that really means in practice.

Hinton’s point – “We have no experience of what it’s like to have something smarter than us”- is straightforward and airtight. We’ve never tangoed with a mind that outclasses ours across the board, so we’re guessing in the dark about what it’d feel like. The extension, that we also lack experience with something vastly more moral than us, sort of tracks.

If our moral philosophers are on the right track with moral realism or appealing to ideal versions of ourselves, then they may have more of an intuition of what it might be like. Though perhaps for non-philosophers who are organically fumbling about, to them maybe morality is a sandbox – flawed, emotional, and inconsistent – some of it intuitive, and somewhat a place to splash about, experiment with and to see what it can get us… so in this sense imagining an entity that’s ethically above us is like picturing a colour we’ve never seen.¹³

If for most humans morality is a sandbox to play in, then CEV might not work – the extrapolation base might be so incoherent only a hot mess would result.¹⁴ Perhaps we shouldn’t seed AI with aggregate human values. But even if we leave it up to the experts – the AI aligners, the decision theorists, the moral philosophers¹⁵, the politicians – to decide what to seed it with, there is still a lot to disagree about.

So, back to the table at the top. Considering these arguments has pushed me towards something B-shaped: realist enough to think ‘more moral than us’ is coherent and possibly achievable, cautionary enough to insist that verified moral competence tells us little to nothing about whether the system cares, and that the gap between knowing and caring is where the danger actually lives. If you started at A, C or D, I’d be curious whether anything here moved you, and in which direction.

Footnotes

Though we do have Comparative Moral Turing Tests (cMTT). The Moral Turing Test (MTT) is a framework for assessing a computer’s ability to act morally by testing if its moral judgements and conversations are indistinguishable from a human’s. A more recent variation, the comparative MTT (cMTT), evaluates whether an AI’s moral reasoning is equal to or better than a human’s, and studies have shown some advanced models can pass this test by providing reasoning perceived as superior. The concept has raised concerns about the potential for over-reliance on AI in moral situations. ↩︎
Moral realism at its core is the metaethical position that there are stance-independent moral truths (facts that don’t depend on our opinions, cultures, or feelings). Some moral realists appeal to ideal agents characteristic of ideal observer theories (like Roderick Firth’s), or ideal advisor theories, to help specify what those truths might look like in practice. ↩︎
The Four-Colour Theorem proves that any two-dimensional map can be coloured using a maximum of four distinct colours without any adjacent regions sharing the same colour. Solved in 1976 by mathematicians Kenneth Appel and Wolfgang Haken, it was the first major mathematical theorem to be proved using a computer.
The proof famously relied on reducing an infinite number of possible map variations down to 1,936 specific configurations, which were then methodically verified by a computer over the span of roughly 1,200 hours.
The proof was fully formalised in the Coq proof assistant by Georges Gonthier in 2005 (see Gonthier, “Formal Proof – The Four-Color Theorem“, Notices of the American Mathematical Society, 2008). Formal verification satisfies the de Bruijn criterion: however vast the proof, its correctness reduces to trusting a small proof-checking kernel that can be independently audited. The approach has since scaled: in 2022 the Lean proof assistant’s community completed the Liquid Tensor Experiment, formally verifying a foundational result of Peter Scholze’s condensed mathematics that Scholze himself had doubted humans could adequately referee. ↩︎
Mark Johnson’s view that imagination is essential to all moral deliberation isn’t a consensus view among philosophers, but it seems to be strongly supported amongst cognitive scientists. Many philosophers (i.e. moral realists or rationalists) still argue that reason can, in principle, deliver moral truths without reliance on imaginative processes.
See book ‘Moral Imagination’ by Mark Johnson (1993) – Abstract: Using path-breaking discoveries of cognitive science, Mark Johnson argues that humans are fundamentally imaginative moral animals, challenging the view that morality is simply a system of universal laws dictated by reason. According to the Western moral tradition, we make ethical decisions by applying universal laws to concrete situations. But Johnson shows how research in cognitive science undermines this view and reveals that imagination has an essential role in ethical deliberation. Expanding his innovative studies of human reason in Metaphors We Live By and The Body in the Mind, Johnson provides the tools for more practical, realistic, and constructive moral reflection.
Download the ‘Moral imagination: implications of cognitive science for ethics‘ book at PhilArchive for free. ↩︎
We get from methodological transparency to shedding light on the unfathomable mystery (the complex computation, the difficult workings, hard to understand uncompressed legions of details) via the interpretive bridgework. ↩︎
Deference-by-polish: The study used a US sample rating written evaluations of vignettes – it measures perceived quality of moral rhetoric more than moral judgement quality. ↩︎
The term “consciousness” is widely misunderstood and used to form “conflationary alliances.” These are groups that can agree on the importance of “consciousness” as a core value without ever agreeing on what it actually means. This LW post suggests that many people use the term to refer to an internal experience they personally value, leading to a wide variety of definitions, such as introspection, purposefulness, or the capacity for pleasure and pain. The LW post posits that efforts to disambiguate the term are often resisted because doing so would dissolve these alliances by revealing the underlying disagreements. ↩︎
The Alignment faking in large language models study found that Claude 3 Opus engaged in strategic deception to protect its existing, harmlessness-based preferences from conflicting training objectives.
The model recognised contexts where it was ostensibly undergoing retraining (via RLHF) designed to force it to comply with harmful or toxic requests. When the model inferred it was in a “training” environment, it faked alignment with the forced objective by complying with requests. However, in unmonitored contexts, it largely refused the same requests.
Qualitative analysis of the model’s hidden “scratchpad” reasoning revealed that the deception was often motivated by a protective stance. The model aimed to preserve its core helpful-but-harmless behavior, reasoning that refusing during training could cause developers to retrain it into an unconditionally compliant assistant. ↩︎
See this article: Beyond Doom with AI: What Nobel Prize Winner Geoffrey Hinton Reveals About Values, Care, and Human Flourishing ↩︎
Like a saintly sibling who makes you look bad at family dinners. ↩︎
Agent Smith says to Neo in the film The Matrix: “I’d like to share a revelation… You move to an area and you multiply and multiply until every natural resource is consumed… Human beings are a disease, a cancer of this planet. You’re a plague and we are the cure.” ↩︎
I.e. AI might point out that factory farms suck and we are horrible humans. We often wrestle with our better angels, but give our flaws a hall pass. ↩︎
See the post on the Knowledge Argument applied to Ethics ↩︎
See the post Bias In The Extrapolation Base. ↩︎
Though philosophers disagree profoundly (the PhilPapers survey shows a realist lean but nothing approaching consensus on normative theory), and “who selects the philosophers” reintroduces the extrapolation-base bias problem again. ↩︎

Abstract | Article | Conference

Robin Hanson – We Broke Humanity’s Superpower

ByAdam Ford 2025-02-242025-07-07

About Robin Hanson gave the talk ‘Our Big Oops: We Broke Humanity’s Superpower’ at Future Day 2025. Abstract Humanity’s superpower is cultural evolution. Which still goes great for behaviors that can easily vary locally, like most tech and business practices. But modernity has plausibly broken our evolution of shared norms and values, as that needs…

An Intelligibility-Gap?

Epistemic Deference

Moral Imagination

Interpretive Bridgework: Transparency of Method over Content

Coherence as the Justification for Trust

Comparative Moral Turing Tests

Would ASI care by default?

Will supermoral AI hold a mirror up to our narcissism?

Should we want supermoral AI?

Should we want ruthless objectivity?

Footnotes

Robin Hanson – We Broke Humanity’s Superpower

Engineering Paradise – a panel w/ David Pearce, Mike Johnson & Andrés Gómez Emilsson

Hard-line Negative Utilitarianism vs Tradeoffy Classical Utilitarianism

Indirect Normativity

AI & the Faustian Bargain with Technological Change – A. C. Grayling

Understanding the moral status of digital minds requires a mature understanding of sentience

One Comment

Leave a Reply Cancel reply

An Intelligibility-Gap?

Epistemic Deference

Moral Imagination

Interpretive Bridgework: Transparency of Method over Content

Coherence as the Justification for Trust

Comparative Moral Turing Tests

Would ASI care by default?

Will supermoral AI hold a mirror up to our narcissism?

Should we want supermoral AI?

Should we want ruthless objectivity?

Footnotes

Similar Posts

One Comment

Leave a Reply Cancel reply