Indirect Alignment
Since I’ve discussed ‘indirect alignment’ between AI and humans elsewhere, I thought I’d dedicate specific writing to this topic.
It’s important to get AI alignment right because intelligence is powerful, and we are heading towards superintelligence fast.
AI safety hinges not just on what an AI can do, but on what it ought to do, so we should be as careful as humanly possible about the assumptions we use to get “ought” values into play.
Consider an alignment-to-reality approach vs alignment-to-current-human-preferences approach.
If moral realism (or something close to it) is true, then the second diagram captures the idea well: humans and AI don’t have to align to each other directly – instead they can both align to the same stance-independent truths, and compatibility emerges as a by-product. This is the “indirect alignment” path.
There are substantial challenges associated with aligning AI systems to aggregate human values. It is possible that such alignment (partial or full) may be technically infeasible, morally undesirable, or both. Key obstacles include:
- Incoherency of human values
- Human values are internally incoherent (within a human, there are inconsistencies between values)
- Aggregate human values are externally incoherent (across different cultures and even within them)
- Human values can self defeating
Why aligning to human values as they are could be wrong
Aligning AI to human values exactly as they are risks hard-coding harmful or outdated norms into systems that could persist for millennia. Many human values are maladaptive, cruel, or parochial – such as valuing dominance, prejudice, or short-term gratification – and locking these into AI could entrench them indefinitely. As the world changes, particularly in scenarios involving post-scarcity economies, digital minds, or interstellar expansion, some values that once served us well may become obsolete or actively dangerous. We see precedent for more cautious ethical postures in philosophical traditions from Kantian ethics to utilitarianism and virtue ethics which share the view that not all human preferences deserve preservation simply because they exist1.
Moral realism as the common reference frame
Moral realism holds that there are objective, stance-independent truths about what is morally good or bad, truths that exist regardless of what anyone believes. According to this view, certain moral facts are as real as physical facts, even if they are harder to observe directly. The PhilPapers surveys of 2007 and 2020 both show that philosophers lean toward moral realism, with around 56–62% endorsing it2, suggesting that many experts see morality as grounded in objective reality rather than mere cultural convention.
If AI is to be designed under this assumption, the goal should not be to simply “install” human values as they are, but to create systems capable of discovering moral facts through reasoning, and for naturalist moral realists this would include empirical engagement with moral features in the world. Such an AI would need to track these facts over time, updating its goals accordingly, and avoid locking in flawed or parochial human norms that might impede moral progress.
Why a superintelligent AI could surpass human morality
A superintelligent AI would likely have significant cognitive advantages over humans – far greater capacity for abstract reasoning, the ability to synthesise insights across domains, and more sophisticated handling of moral uncertainty. Its a posteriori moral insight could allow it to detect correlations between moral features (such as the anatomical correlates of pleasure or suffering) and the deeper facts about their value more reliably than any human could. Moreover, unlike humans, who are prone to bias, self-deception, and motivated reasoning, an AI could be engineered to avoid these epistemic traps, allowing it to make clearer, more principled moral judgements.
Mutual compatibility via independent approximation
If both humans and AI independently approximate moral realism well enough, their values could converge on similar regions of value space, even if they start from very different origins. Certain strategies – such as cooperation, non-harm, fairness, and sustainability – may be convergent for intelligent agents, arising naturally as optimal for survival and flourishing. This creates a form of robust indirect alignment: the shared anchor is not present-day human values, but something external and objective, making it more stable than simply teaching AI our current preferences.
Design challenges
Several formidable obstacles stand in the way of aligning AI to moral realism. Here we discuss the epistemic access problem (can we access moral truths?), the problem of verification (how to verify moral truths?) and the transition risk (bootstrapping heuristics to prevent catastrophic errors before AI truly surpasses humans in moral capability).
The epistemic access problem is fundamental: even if moral facts exist, we have no direct, infallible way of observing them. An AI trained primarily on human moral discourse risks learning a sophisticated imitation of our moral language rather than a genuine grasp of stance-independent truths. This creates the danger of mimetic moral reasoning – outputs that sound right to us but are divorced from the underlying reality they are meant to track. Potential mitigations include training AI to treat human discourse as provisional evidence rather than ground truth, and to use cross-domain reasoning (philosophy, cognitive science, game theory, empirical moral psychology) to triangulate toward deeper moral structures.
Verification compounds the problem. Without privileged access to moral facts, we cannot simply check whether the AI “got it right.” Instead, we must rely on indirect tests: counterfactual robustness (does the AI’s moral reasoning hold under hypothetical scenarios?), consistency across moral domains, and the ability to explain its reasoning in ways that track known moral features. This is where indirect normativity3 comes in: rather than specifying the end-state of morality, we specify the process – training the AI to model what we would value if we were more informed, more rational, and more impartial. This shifts the burden from finding the “correct” moral facts up front to building an epistemically virtuous search process.
The transition risk is a practical hazard: in its early stages, an AI might lack the competence to reliably track moral facts, making it unsafe to grant it unbounded autonomy. Here, bootstrapping heuristics4 are essential. These could include conservative action constraints (“do no harm” style guardrails), reliance on high-confidence moral features (e.g., avoid causing intense suffering), and staged capability deployment so that moral reasoning sophistication keeps pace with decision-making power. Over time, as the AI’s epistemic and moral competence increases, these provisional constraints can be relaxed in favour of deeper alignment to moral realism.
Footnotes
- Should all human preferences persist just because they exist?
The obvious example of slavery: For much of human history, many societies not only tolerated but actively valued slavery, embedding it into law, religion, and economic systems. It was defended on grounds of tradition, economic necessity, and even supposed moral duty. Today, it’s clear that this “value” was rooted in dehumanisation, exploitation, and moral error. Preserving such a preference would be indefensible, and it illustrates why not all human values- no matter how ingrained or widely held – deserve to be carried forward.
A less obvious example of public brutality: For centuries across Europe, Asia, and elsewhere, crowds would gather to watch hangings, beheadings, burnings, or more elaborate tortures. These events were not merely tolerated – they were often celebrated community spectacles, justified as deterrence and seen as affirming social order. Many people genuinely believed such spectacles served the public good, and attendance was framed as a civic or even moral duty.
It took centuries for societies to recognise that this preference rested on desensitisation to suffering, normalisation of cruelty, and the false assumption that public brutality fosters justice. The eventual shift away from such spectacles wasn’t just about changing sensibilities – it reflected a deeper moral insight that dignity and humane treatment should extend even to those who have committed grave crimes.
This kind of slow moral reversal is exactly the sort of thing that shows why blindly preserving inherited values is dangerous: even values sincerely held for centuries can be rooted in deep moral error. ↩︎ - See PhilPapers survey question on moral realism for 2007 and 2020 – most philosophers accept or lean toward moral realism, that objective ethical truths are discovered, not invented. ↩︎
- See post on Indirect Normativity ↩︎
- See ‘Moral Machines: Teaching Robots Right from Wrong‘ by Colin Allen & Wendell Wallach – they discuss a top down and bottom up approach to AI ethics. I also recently did an interview with Colin Allen on this topic. ↩︎