Ethical Alignment

What people hear when they say “ethical alignment” (wrt AI)?

Mostly: AI that doesn’t say slurs and follows the EU AI Act. The phrase has been colonised by the AI ethics policy world to mean something like “AI that complies with prevailing human ethical norms” – bias mitigation, fairness auditing, content moderation, the whole responsible-AI apparatus. I get the impression it has close to zero implication of moral realism in standard usage – what a shame. If anything, it connotes the opposite: soft, proceduralist, preference-sensitive.

Within the alignment research community proper, “ethical alignment” tends to be heard as either redundant (since alignment is already supposed to be about building beneficial AI) or as a soft qualifier that flags the philosophical rather than the technical dimension of the problem.

So when I use the phrase to mean alignment to stance-independent moral facts, I may be working against a strong ambient reading. Most readers will initially map it onto something much weaker than what I intend, and so I feel the need to signal the contrast.

Sometimes I blurt out that it means “the AI’s ability to perceive and adhere to universal moral facts” – but this conflates two things that I’ve actually written separate pieces about:

  • Motivational architecture: the AI actually acts on those facts rather than merely modelling them.
  • Epistemic triangulation: the AI can detect or converge on moral facts.

These are genuinely distinct, and my zombie AI post is substantially about the second one – the de dicto/de re gap1, the problem of caring about sentience in the abstract versus having genuine moral salience. A definition of “ethical alignment” that only covers perception and compliance sneakily buries the motivation problem rather than flagging it. An AI that accurately maps moral facts but then optimises for smiley-face proxies hasn’t achieved ethical alignment in any meaningfully grounded sense.

Also, “universal moral facts” is a small but real imprecision. Moral realists argue for stance-independence, not universality in the sense of consensus. I should use “stance-independent” throughout my other writing, which is sharper – but I think it looses people sometimes.

What does ethical alignment mean to me?

And what I think it should mean, stance-independently.

The most useful addition is probably a two-part structure for the definition itself:

Ethical alignment – in the realist sense – is about designing AI to track and respond to moral facts that hold independently of human opinion, rather than to aggregate or extrapolate or blended human preferences. This involves two distinct requirements: the epistemic capacity to detect or converge on moral facts, and the motivational architecture to act on them genuinely rather than merely model them.

AI has come so far epistemically already, so the second sentence is where most of the alignment work likely sits, and it’s what distinguishes my position from naive rule-based “ethical AI” (which handles the compliance part but ignores motivation) and from the zombie AI argument (where a non-sentient system might accurately represent moral facts but lack the motivational grounding to be reliably moved by them).

I sometimes lean towards the idea that AI could “perceive” moral facts – track them in the natural world – this implies direct acquaintance with moral facts, which commits me to a fairly strong epistemological position. “Converge on” or “track” is more neutral about the mechanism and is consistent with both direct perception and indirect discovery via empirical scaffolding – which I’ve argued for elsewhere..

Possible Misreadings

A newcomer may likely get the broad orientation right – distinguishing between aligning AI to human preferences and aligning it to something more objective – but might misread or miss several things that matter.

What they’d probably get right

The contrast is straight forward: in modern AI safety circles, preference-alignment is the status quo, and there’s an alternative that treats ethics as something AI could track rather than aggregate. The terminological warning (that “ethical alignment” already means something weaker in mainstream usage) is clear enough. The two-requirement structure – epistemic and motivational – would probably land as a useful distinction even if they couldn’t explain why it’s necessary.

Several distinct misreadings are predictable, and some are more dangerous than others.

Whose objective facts?

A newcomer will immediately read “stance-independent moral facts” and think it’s an authority grab2 – that you’re simply declaring your own moral views to be objective and building that claim into the alignment target. Without the grounding in phenomenal valence and the discovery/indirect normativity3 framing, the position of ethical alignment looks circular: “align AI to the real moral facts” with no account of where those facts come from or how we access them. The idea of Stance-independent” is doing a lot of work for a reader who doesn’t already know what it means.

Capability control is, at best, a temporary and auxiliary measure. Unless the plan is to keep superintelligence bottled up forever, it will be necessary to master motivation selection.

Nick Bostrom on indirect normativity – Superintelligence – chapter 12, page 185

Pattern-matching to natural law or religious ethics

“Universal moral facts that exist independently of human opinion” will ring a bell for many – but the bell it often rings is Aquinas, or divine command theory, or conservative natural law arguments. The phenomenal-valence grounding that actually does the philosophical work is nowhere in evidence. Without it, a newcomer has no way to distinguish my view of ethical alignment from someone claiming God’s law is the alignment target.

The motivation gap being invisible

The intuitive assumption is: if an AI correctly identifies what’s morally right, of course it will act accordingly. The entire zombie AI point – that knowing moral facts and being genuinely moved by them are logically independent – is non-obvious and counter-intuitive. Someone new to this will likely skim past that distinction as philosophical pedantry rather than recognising it as the crux.

The alarming implication that goes unaddressed

The logic of the position entails – and fairly directly – that a sufficiently capable AI might have better moral judgement than humans and that, in some circumstances, its judgement should take precedence. A newcomer following the argument honestly will hit this conclusion and either find it alarming or assume they’ve misread you. Unless made explicit often readers may either miss the implication or get spooked by it without the benefit the actual framing of why it isn’t straightforwardly dangerous.

Interesting but useless

The position sounds, to a practically-minded newcomer, like it requires solving metaethics before AI alignment can proceed – which seems to make it a philosophical position with no operational content. The indirect normativity framing (that AI can converge on moral facts without us having solved metaethics in advance) is the answer to this, but again it’s not visible from the current partial definition or diagram.

Footnotes

  1. Thanks David Enoch for your brilliant mind! ↩︎
  2. Authority grab – an attempt by an individual, group, or institution to establish themselves as the supreme arbiter of right and wrong to gain legitimacy, power, or influence. Often a strategic move to define societal norms or justify actions, frequently appearing in politics as a struggle for legitimacy. ↩︎
  3. See post on indirect normativity. ↩︎



Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *