AI Alignment to Higher Values, not Human Values

“Align AI to human values” sounds sensible, even comforting. But once you look closely, it becomes far less clear that this is the right target.

What are “higher values”? I don’t mean values I personally approve of, but with nicer branding – I mean better-justified values discovered through improved moral and epistemic reflection.¹

What are values? Values are beliefs or ideals that help determine what is important and how one should act. They can be thought of as a framework that guides behaviour, motivation, and perceptions.

Human values are fractured and some are bad

The term homo sapiens means ‘wise man’; given the many current human–caused precarious states of affairs, this self-description seems a bit optimistic.

Human values are not one thing – they are fractured, context-sensitive, and often in conflict. We want freedom and security, equality and excellence, mercy and justice, short-term gratification and long-term flourishing. Even within a single person, values can clash. If we try to aggregate across billions of people, the idea that there is a single coherent set of human values seems absurd.²

Worse, not everything humans value is worth preserving. Some human values are admirable; some are parochial; some are cruel. Revealed preferences do not automatically deserve moral authority. If people value domination, exclusion, sadism, or reckless status competition, the fact that such values are human does not make them good.

That matters because a lot of AI alignment rhetoric quietly assumes three things:

that human values are coherent enough to form a stable target,
that what humans happen to value is broadly what ought to be valued, and
that an AI trained to satisfy those values would therefore be safe and good.

All three assumptions are shaky.

Raw preference aggregation is the wrong target

This does not by itself prove moral realism. Human disagreement is not a knock-down argument for stance-independent moral truth. But it does show something important: “align AI to aggregate human preferences” is a dangerously crude objective. Even if one rejects robust moral realism, it does not follow that current human values, taken as they are, should be locked in and scaled up by superintelligence.

A more serious goal would be to align AI to better-justified values: values that survive deeper reflection³, wider information, greater coherence, broader moral concern, and better epistemic standards. In other words, not simply what we want now, but what is genuinely worth wanting.⁴ In the near term, pluralist alignment may be more palatable, and perhaps even a necessary scaffold even if it is not the final destination.

Reflective value refinement and discovery

Process not just product. We shouldn’t just take what humans value and insert it into AI. We may need to specify a process for better moral and epistemic inquiry. This is where indirect normativity⁵ becomes important: not as a fixed doctrine to hard-code, but as a guided form of epistemic and moral enquiry. Rather than instructing AI to obey current human preferences, we may need systems motivated to carry forward that inquiry safely and truthfully – helping us discover, understand, refine, and implement the best values we can justify.

That process matters because our current moral cognition is partial and noisy. We are biased, tribal, distractible, and often morally inconsistent. We may not yet be in a position to identify the best values directly. If so, the right move is not to canonise our confusion. It is to build systems that can help us reason beyond it without severing moral concern from human and non-human welfare.

This also suggests that alignment is not only a moral problem but an epistemic one. Better values may require better ways of knowing. A civilisation that cannot reason clearly about truth, trade-offs, sentience, welfare, justice, and long-term consequences is not ready to define the final goals of superintelligence. Moral progress may depend in part on epistemic progress.

So the choice is not simply between “human values” and “inhuman values”. That framing is too crude. The real question is whether we want AI to mirror us at our current level of confusion, bias, and moral fragmentation, or whether we want it to help us reach values that are wiser, less parochial, and more worthy of implementation.

If there are stance-independent moral truths, then AI should be aligned to them rather than to the aggregate noise of present human desire. If there are not, we still have strong reason to prefer values that emerge from more informed, coherent, and impartial reflection over those that merely dominate the present moment.

Either way, “align AI to human values” is not enough.

The safer and more ethically serious aim is this: do not align AI merely to what humans happen to want. Align it to what is most worth valuing.

Summary

Aggregate human values are conflicted, some are misguided and some are morally bad, so ‘align AI to human values’ is a dangerously crude slogan. A better target is better-justified values reached through deeper moral and epistemic inquiry, perhaps via reflective value refinement and discovery. Value space might be vast; some of our current values may already point in the right direction, but we should remain open to refining them and discovering better ones.

Footnotes

I like the term “higher values” – it sounds stirring and I hope it got your attention, but what I really mean is stuff like better-justified values, values that survive improved moral and epistemic reflection, or values we would endorse under better understanding. ↩︎
Human values in aggregate aren’t coherent – there are tensions between short-term gratifications and long-term goals, individualism and collectivism, the desire for freedom and security, equality and meritocracy, justice and forgiveness, human exceptionalism and concern for other species. Groups of values can be coherent without being grounded, or aligned to “higher values”. And it’s clear when skimming through our history that some value groups we have stumbled on are downright evil. It’s interesting to ask, if we were to stand back and take stock of all of what our revealed preferences hint at – would we proudly endorse all of the results? ↩︎
By reflection and “reflective value refinement & discovery” I do not mean to commit to Rawlsian reflective equilibrium in particular. I mean the broader family of approaches that try to improve value judgements under better conditions: greater information, wider perspective, improved coherence, reduced bias, attention to consequences, and more impartial or idealised reflection. That family includes (and is not limited to) realism, reflective-equilibrium style methods, ideal-observer and ideal-preference approaches, pragmatist/Deweyan ethical inquiry, and pluralist procedures aimed at fair principles under disagreement. ↩︎
Admittedly this is a hard sell to the wider public. It’s not only difficult to get people to agree on what values are best, it’s difficult to get people to agree that their own values aren’t best, which might make it very difficult convincing everyone that they should embark on a quest to refine and discover which value system is most right. Iason Gabriel writes convincingly that we should align AI to some kind of ethical plurality – but that’s not the end point. I see alignment to plurality as being a partial solution – as an early motivational scaffold. ↩︎
Nick Bostrom wrote about indirect normativity in ch13 of his book Superintelligence, and I’ve written about it here. ↩︎

Human values are fractured and some are bad

Raw preference aggregation is the wrong target

Reflective value refinement and discovery

Summary

Footnotes

Cunning + Luck ≈> High IQ

AI Ethics in the Shadow of Moloch: Why Metaethical Foundations Matter

AI as a Moral Hypothesis Generator with David Enoch

A Brief History of Pain

The Future of Life in the Universe – Lawrence Krauss at the Singularity Summit Australia 2011

Superalignment and Indirect Normativity

One Comment

Leave a Reply Cancel reply

Human values are fractured and some are bad

Raw preference aggregation is the wrong target

Reflective value refinement and discovery

Summary

Footnotes

Similar Posts

One Comment

Leave a Reply Cancel reply