AI Alignment to Higher Values, not Human Values
The meaning of homo sapien means ‘wise man’ – given the many current human–caused precarious states of affairs, this self-description seems a bit of a reach.
What are values? Values are beliefs or ideals that help determine what is important and how one should act. They can be thought of as a framework that guides behavior, motivation, and perceptions.
Human values in aggregate aren’t coherent – there are tensions between short-term gratifications and long-term goals, individualism and collectivism, the desire for freedom and security, equality and meritocracy, justice and forgiveness, human exceptionalism and concern for other species. Certain groups of values can be sort of coherent – but to me these groupings just really aren’t great. And it’s clear that some values are downright evil – especially if our revealed preferences hint at what human values really are.
Here Jonathan Swift writes in Gulliver’s Travels, the famous judgment by the king of Brobdingnag on the people of England:
“I cannot but conclude the Bulk of your Natives [the people of England], to be the most pernicious Race of little odious Vermin that Nature ever suffered to crawl upon the Surface of the Earth.”
Quite a scathing sentence. Perhaps there can be a balance struck between naïve and unthinking acceptance of human values and the kind misanthropy that calls for human extinction to solve humankind’s flaws. Though misanthropy as a theoretical lens (not an emotional attitude) can help dispel some of the romantic illusions about humanity being the best thing that forever that people often cling too.
Connor Leahy, who works in AI safety, says in an interview:
“Can there be such a thing that aligns to human values or at least doesn’t have it’s own goals or whatever?… I think physically there’s no reason to believe it’s not possible…the more interesting question is ‘in practice will we build one?’ and currently the answer is absolutely not… given the current political situation, the current market situation, there is an exactly zero percent chance we will find that answer.”
If you think that ‘Zero percent chance’ sounds hyperbolic, I’d agree with you But I’d also agree that it is very hard. Aligning AI to the aggregate of incoherent human values is like shooting an arrow into an optically illusory bullseye. Not that we’d want to if we could – what Connor didn’t mention is that some human values are clearly unethical.
It’s difficult to get people to agree on what values are best, which might make it very difficult convincing everyone which value system is most right. Iason Gabriel writes convincingly that we should align AI to some kind of ethical plurality, I kind of agree at least as an early motivational scaffold.
Do we want an AI that perfectly aligns to human values? If yes, we should be careful what we wish for. The pertinent question isn’t as much about what we desire*, but what we need – what’s actually best for us.
My take:
Human values are complex and fragile (see Eliezer Y) – we disagree on what they are and many values are in conflict with each other. So it stands to reason that it could be disastrous to align AI to human values.
If:
a) there is zero to low chance of successful alignment to the aggregate of human values
..and
b) there are objectively (or stance-independently) higher values
..then:
c) some of the better angels of human values may also track some of the objectively higher values
..and
d) we should seek to understand, align to and implement the higher values we don’t currently have
I’ve argued elsewhere that we should wisely motivate contained Oracle AI to increase its own understanding of higher values, then help us understand them, and ultimately implement them. I’m partial to moral realism – but whatever value rightness ends up being, we should align AI to it – and for that matter, if we are ethically serious, we should too. Nick Bostrom calls this ‘indirect normativity’ which he describes in ch13 of his book Superintelligence.
- values are measurable, can be verified
- values aren’t arbitrary, like engineering principles which align to the laws of physics
- It’s highly likely there are values we haven’t discovered yet
- relativism isn’t true, but it’s purported virtues can be rolled up into consequentialism
If some values are complex, difficult to cohere but not arbitrarily and generally get right then indirect normativity seems like a good path to arrive at objectively awesome values, that would pursue objectively better values.
In order to understand better values, our current epistemics may get us there, or perhaps improvements need to be made to the epistemics to fully grasp and select for the best of what exists in value possibility space.
Conner Leahy later says in the aforementioned interview “…an aligned ASI super intelligence .. something’s called a sovereign.. or a Angelic type system would be a being of incredible intelligence and benevolence – it would be one that fundamentally would help Humanity, despite itself, towards a kind of goodness that even Humanity itself does not embody. This would be the closest that we have to the concept of an Angel – this is the closest thing we have in our vocabulary to describe what this being would be like.. depending whether you’re abrahamic or non-abrahamic.. there are no non-religious terms that accurately describe what this being would be like.”
Appealing to religious exemplifications of goodness can feel transcendant or viscerally appealing, there is a lot of history and stories embedded in our culture about religious motifs… though, as Stephen J. Gould says they exist in a separate magisterium to science and rational discourse, and so aren’t amenable to epistemology and empiricism – our most accurate and tractable means of understanding the world. This where the study of value really should be grounded – not in some purportedly separate realm that we can’t see and verify.
Value space is probably incomprehensibly large. Some of our values are likely higher values

Footnotes
- As part of the Sydney Dialogue Summit Sessions, David Wroe sits down with Connor Leahy – YouTube video here.
- What we desire may not be what is best for us. See SEP article on desire.
- Stephen Jay Gould’s concept of non-overlapping magisteria (NOMA) is a model for the relationship between science and religion that proposes that the two fields have separate domains of teaching authority, or “magisteria”, and do not overlap. Gould believed that science focuses on the empirical world, while religion addresses values like morality. He argued that conflict between the two fields can only occur if their domains overlap, which he believed they do not.