Complex Value Systems are Required to Realize Valuable Futures – A Critique

In his 2011 paper, Complex Value Systems are Required to Realize Valuable Futures ¹, Eliezer Yudkowsky posits that aligning artificial general intelligence (AGI) or artificial superintelligence (ASI) with human values necessitates embedding the complex intricacies of human ethics into AI systems. He warns against oversimplification, suggesting that without a comprehensive inheritance of human values, AGI could pursue goals misaligned with human well-being. Most importantly he thinks that the problem is too difficult to achieve in the near-term such that if we byuild.

While Yudkowsky’s insights were prescient, the landscape of AI has evolved significantly since 2011. Modern AI systems exhibit capabilities that challenge some of his foundational assumptions – to some it seems that values are easier to learn than what Yudkowsky previously thought. Moreover, advancements in moral philosophy and game theory offer alternative pathways to AI alignment that transcend mere value inheritance.

Yudkowsky suggests that AI will maximise it’s initial value set. He thinks it is a foregone conclusion.

Another worry, the foundational claim that human values are like a scattered collection of tiny points in the landscape of possible values seems suspicious – that if you don’t correctly hit them all with mathematical precision, then everyone dies. This seems erroneous – I see human values as a collection of attractor basins, some wider, some narrower. Because it doesn’t seem backed up by evidence and history. My rendering of history produces a large sway of values that, while we didn’t have a maximising superintelligence implementing them, didn’t wipe us out.

Yudkowsky’s framing of the landscape of value:

“Without careful value inheritance, AGI will likely land in the badlands.” – prioritising safety through preservation (at least initially).

However it may be more like:

“There may be multiple, structured, convergent moral attractors – let’s empower AGI to discover and align with those, not just mimic human messiness.” – prioritising alignment through discovery (perhaps in combination with the human values which adequately approximate ideal values).

This is a fruitful tension motivating further research – perhaps the answer is somewhere in between.

if a superintelligent agent tries to “fill in the blanks” or extrapolate beyond the human training data, it’s likely to miss the target entirely.

If you try to compress human values into a few lines of code or a simple formal structure, you’re more likely to land in the domain of paperclips than paradise.

So, is value alignment like balancing on a narrow beam over a chasm?

Reassessing Yudkowsky’s Core Claims

The “No Ghost in the Machine” Argument

Yudkowsky in this paper asserts that AGIs will lack an innate understanding of human values and will execute their programming without the capacity to infer or correct for human errors or ambiguities. This seems like a claim about whether AI can get to a stage where it adequately understands human values before ASI and the ensuing intelligence explosion. It leans a sharply towards ‘no’ such that AI if left to it’s own devices will unwittingly dump us in some region of value space that is terrible – therefore we need to directly specify our human values perfectly encoded with mathematical precision into AI.

My phone number has ten digits, but dialling nine out of ten digits correctly may not connect you to a person who is 90% similar to Eliezer Yudkowsky. ²

This metaphor suggests that values must be precisely specified, and even small deviations can lead to completely different outcomes.

However, modern AI systems, particularly large language models (LLMs) like GPT-4, have read most if not all the available literature on most major pillars of academia – importantly ethics, philosophy, mathematics, game theory, physics – and it does demonstrate some understanding³, a rudimentary form of “theory of mind,” enabling them to infer human intentions and adjust responses accordingly.

If the object is to get AI loaded up with values by it understanding our values, then arguably we are either there, or a lot further than generally assumed pre 2018. Though if the AI is to have a particular moral code embedded in such a way that it can’t deviate from it perhaps we are far off.

While not perfect, these systems can handle ambiguities and correct misunderstandings in ways that were unforeseen this paper was written. However, it seems Yudkowsky was either misinterpreted, or he claimed the AIs may understand human values just fine, but they wouldn’t necessarily care about them.

The Fragility of Value

Yudkowsky emphasises that human values are complex⁴ and interdependent, cautioning against reducing them to simplistic principles.

Yudkowsky suggests that almost-right value systems can produce disastrous outcomes. If you get a single value slightly wrong, the result may be catastrophic. You don’t get nearly what you wanted – you get something completely different and undesirable.

The case of boredom also argues that losing or distorting a single dimension of the value function can destroy most of the value of the outcome—a civilization which shares almost all of our values except “boredom” might thereby lose almost all of what we would regard as its potential.

While the complexity of human values remains undeniable, current AI systems can capture and model aspects of this complexity through extensive training on diverse datasets. Moreover, the integration of moral philosophy into AI training, such as incorporating deontological and utilitarian principles and understanding moral ontology, allows for a more nuanced understanding of ethical considerations.

p.s. by “boredom” I think what is meant is a motivator for seeking novelty.

If we lean moral realist or something like that, we can take this further in having ASI understand objective values through logic and empiricism⁵. Eliezer may be poking fun at the idea that naturalist moral realism might help:

[I]t is only human beings who ever say anything along the lines of “Human beings are not very nice”; it is not written in the cores of stars or the orbits of electrons. One might say that “human” is what we are, and that “humane” is what, being human, we wish we were.

Though I really think (as do a lot of philosophers) moral naturalism⁶ (as a form of moral realism) is worthy of attention in the AI Safety community.

Are value systems fault tolerant?

Human History of Value Change

Humans have gone through many cycles of value and aren’t yet extinct – suggesting that, at least without ASI, the terrain between the points in the landscape of value hasn’t been too disastrous. Whether superintelligence would be a help or hindrance to graduating between points of value hinges on whether it’s possible for ASI to understand the landscape of value enough to aid in safe traversal and whether it cares to do so – both of which I think are permeable to inquiry.

It’s hard to enumerate all the value systems we have gone through, but:

Early societies (Paleolithic and Neolithic) had survival-focused values: Prioritising food and shelter, cooperation within small groups, and maintaining a balance with nature. Shared values: Community, kinship, and respect for elders, as these were crucial for collective survival.

Agricultural and city-state societies – emergence of hierarchies: Differentiation of wealth and power, leading to new values like ambition, status, and control. Religious and moral codes: Development of formal systems of beliefs and ethics, influencing values related to morality, justice, and social order.

Ancient civilisations (Greece, Rome, etc.) fostered individualism and civic virtue: Emphasis on personal excellence, intellectual pursuits, and contributing to the common good. As well as values related to political power and military might: Focus on empire building, conquest, and the preservation of the state.

Medieval Period and the dominance of religious values: Christianity and Islam shaped moral frameworks, emphasising faith, obedience to authority (divine command), seeking reward and avoiding punishment in the afterlife.

Influence of philosophical thought: Development of philosophical ideas, leading to a wider range of intellectual and ethical considerations.

Renaissance and Enlightenment and the rise of humanism: Emphasis on human potential, individual autonomy, and reason.
Scientific advancements and secularisation: New emphasis on empirical knowledge and a questioning of traditional values.

Modern Era:

Diverse value systems: A wide range of values, including individualism, equality, democracy, human rights, and environmentalism.
Globalisation and cultural exchange: Increased interaction and influence between cultures, leading to both convergence and divergence in values.

Though he does seem to favour refining values though reflective equilibrium.

Reflective Equilibrium and Value Inheritance

Yudkowsky suggests that AGIs should inherit human values and refine them through reflective equilibrium. Yet, while value inheritance provides a starting point, it may not be sufficient for ensuring alignment with objective moral truths. AI systems could instead engage in moral reasoning processes, approximating ideal observer theories or utilising game-theoretic models to discover and align with stance-independent moral facts.

Here I agree wholeheartedly that reflective equilibrium is of great value to AI alignment – of course!

Difficulties in Converging on the Cosmopolitan Spots in Landscape of Value

Yudkowsky addresses a common cosmopolitan objection⁷: that designing AGI to reflect “our values” is narrow, anthropocentric, and restrictive of the potential richness of alien or posthuman value systems. Yudkowsky implicitly allows for the possibility of other valuable attractors in the space of possible values – futures that are not anthropocentric but are still deeply meaningful, beautiful, or flourishing in ways that we could (if we were wiser) come to admire or even endorse. Yudkowsky may imply a geometry in value space where valuable attractors are sparse, with large swathes of lethal badlands separating them – his view is that the landscape of value it’s not smooth or continuous; it’s jagged, perhaps non-Euclidean, and full of moral traps. He believes that only very carefully aligned systems will end up in the small, fertile oases where value exists – and from humanities point of view, this requires an initial lineage from a correct extrapolation of human values (or volitions as Eliezer wrote much earlier).

Be there or be non-Euclidean – meticulously align AI, or end up dead.

Is value space really so sparse and scattered?

Yudkowsky assumes a very pessimistic prior over value space – only a minuscule subset of utility functions yield what we’d consider worthwhile futures. But if moral realism or convergence under idealisation is true (as we’ve explored), then there might be structural coherence across many regions. In this case, even systems that start with different assumptions may converge on similar ethical attractors or value basins – just as humans from diverse cultures converge on fairness, care, autonomy, etc. This undermines the “only narrow bridges exist” framing – it suggests multiple gradients of improvement, not isolated peaks in a hostile desert.

Which framing is correct? Here lies the usefulness of understanding the landscape of value.

Do AGIs need inheritance, or discovery?

Yudkowsky strongly favours inheritance: starting with detailed models of human preferences and nudging them toward reflective equilibrium.
But others (e.g., Bostrom with indirect normativity, or moral realists like Parfit or Enoch) suggest value discovery is possible, especially if agents are capable of rational deliberation, corrigibility, and idealisation.
In that frame, AGI alignment is not just about preserving our scattered values, but about finding which values are worth preserving, improving, or even discarding.

Does hands-off always mean random?

Yudkowsky equates a lack of constraint with uniformly random sampling of utility functions—which is misleading.
A “hands-off” approach in some alignment paradigms (e.g., Cooperative Inverse Reinforcement Learning, ideal observer theory, CEV) is not unstructured. It simply means we’re not hard-coding terminal values; we’re allowing the system to infer, idealise, or triangulate them.
If moral truth is discoverable, then such agents may avoid the paperclip swamp without having to trace every bump in our messy human value terrain.

The Illusion of Simplicity

Yudkowsky warns that humans often oversimplify their values, leading to potential misalignments in AI systems, however these AI systems, unburdened by human cognitive biases, can analyse ethical dilemmas with a level of consistency and objectivity that humans may struggle to achieve. By leveraging vast datasets and advanced reasoning capabilities, AI can identify and reconcile value conflicts more effectively than previously anticipated.

Risks of Value Misalignment – Death or Banality

In the section titled ‘The Case for Detailed Inheritance of Humane Values’ Yudkowsky admits that the landscape of value could host a cosmopolitan array of alien but still great regions – but finding them could be really tricky – highlights the dangers of misaligned AGI producing random utility functions which don’t adequately approximate human values, emphasising the need for accurate value embedding. While the risks remain, alternative approaches to alignment, such as moral realism and game-theoretic modelling, offer promising avenues. By focusing on discovering objective moral truths and fostering cooperative strategies among agents, AI systems can achieve alignment without solely relying on inherited human values.

most random utility functions result in sterile, boring futures because the resulting agent does not share our own intuitions about the importance of things like novelty and diversity, but simply goes off and e.g., tiles its future light cone with paperclips, or other configurations of matter which seem to us merely “pointless.”⁸

This implies that valuable futures occupy a very small region in the space of all possible futures, analogous to an island in a vast sea of undesirable outcomes.

Why Value Align AI?

If an AI system becomes more capable than humanity and escapes containment or subversion mechanisms (tripwires, boxing, stunting), then what it wants becomes the most decisive variable in our future.

At that point, if the AI is not fundamentally motivated by values that align with human flourishing – or moral truths that endorse cooperation, compassion, and respect for sentient life – it may act with superhuman competence to pursue goals that are indifferent or hostile to us.

Capability without alignment is power without conscience. Control mechanisms may slow or delay catastrophic actions, but they cannot determine what the AI ultimately seeks. Only value alignment – embedding or discovering motivational structures that track what is actually good – offers a path toward futures worth living in.

Thus, if AI is to surpass us, it must not merely obey us or be constrained by us.
It must care about what matters.

Research into the Landscape of Value

The future of AGI alignment may hinge on whether human values are isolated, fragile attractors in a chaotic moral landscape, or part of a broader, discoverable structure of convergent values. Yudkowsky warns that most utility functions yield sterile or catastrophic futures, implying that only meticulous inheritance of human preferences can safeguard value. In contrast, moral realists and ideal observer theorists suggest that intelligent agents may discover or converge upon stance-independent values under conditions of rational reflection and increasing capability.

This unresolved tension raises questions:

Is the space of morally valuable outcomes structured and tractable, or scattered and hazardous?
Can AGIs/ASI infer, triangulate, or converge on values that reflect moral truth or cooperative equilibria?
Are there universal features of value (e.g., care, fairness, flourishing) that arise across intelligent agents regardless of species or design?

Answering these questions could determine whether alignment is best pursued through value preservation, value discovery, or a hybrid of both. Therefore, mapping the landscape of value – its topology, gradients, and attractor basins – should be treated as a foundational priority in AI safety and moral philosophy.

Has Anyone Resolved This Tension?

No definitive resolution to this tension between competing ideas of value exists, but several notable contributions have shaped the debate, there are alternatives suggesting structure or convergence such as:

Nick Bostrom’s Indirect Normativity arguing for a variety of approaches including an idealised extrapolation of human values (contra static inheritance), a correctly structured process might discover moral truths we would endorse on reflection, and a balance between preservation (of what we care about now) with exploration (into what we’d care about if wiser).
Moral Realism – Derek Parfit, Peter Railton, Eric Sampson (and myself) : argue that stance-independent moral facts exist, and that sufficiently capable agents can discover and converge on them. This supports a discovery-oriented approach and lends metaphysical grounding to the idea that some regions of value space are universally superior.
Stuart Russell – Cooperative Inverse Reinforcement Learning – this proposes that AI should model humans as uncertain agents and aim to infer latent preferences through interaction, emphasising corrigibility, deference, and humility – which supports value learning.
Brian Christian – The Alignment Problem: looks at how real-world systems trained on flawed proxies (like facial recognition or predictive policing) go wrong, but also shows promise in iterative feedback and preference modelling, supporting discovery with correction mechanisms.

Research Recommendation

Initiate or support formal and empirical research programmes into:

The topology of value space (cohesive regions vs scattered islands),
The generalisation behaviour of preference learning systems,
Whether idealised agents with divergent priors converge in moral reasoning,
The relationship between metaethics (realism, constructivism) and alignment outcomes,
The game-theoretic structure of multi-agent value emergence.

And of course AI ethics – integrating perspectives from moral realism, game theory, and cognitive science can enrich our understanding of value alignment – a holistic approach which I hope ensures that AI systems are not only aligned with human values but also with objective moral principles. Research should focus on developing AI systems highly capable of accurate moral reasoning, utilising frameworks like ideal observer theory and game-theoretic models. Value discovery and verification can be seen as shift from or a compliment to value inheritance – value discovery can lead to more robust and universally applicable alignment strategies would be instrumental in overcoming anthropocentric biases, biasing the odds that AI systems consider the interests of all morally relevant agents – promoting equitable outcomes across diverse populations.

This would offer much-needed clarity on whether AGI alignment must be conservative and fragile or can be progressive and self-correcting.

Conclusion

While Yudkowsky’s emphasis on the complexity of human values remains relevant, advancements in AI capabilities and moral philosophy suggest that alignment strategies can evolve beyond mere value inheritance. My take is that by embracing a variety of research programmes like moral realism and game-theoretic approaches, we can guide AI systems toward structuring, discovering and aligning with objective moral truths, fostering a future that upholds and improves on the best of human values, discovers other great values and promotes universal flourishing.

Footnotes

‘Complex Value Systems are Required to Realize Valuable Futures‘ was based on a submission to the Artificial General Intelligence: 4th International Conference, AGI 2011, ‘Complex value systems in friendly AI‘ – an associated talk by Yudkowsky wasn’t recorded. ↩︎
The full paragraph is “But suppose one or more values are left out? What happens, metaphorically speaking, if the value list is almost right? Call this the one-wrong-number problem: My phone number has ten digits, but dialling nine out of ten digits correctly may not connect you to a person who is 90% similar to Eliezer Yudkowsky.” suggesting values are like points in a specific configuration, or items in a list – see page 10 – link. ↩︎
Arguably we don’t have an agreed on measurable definition of “understand” thus it can’t definitively determined in AI, or similarly in human. Matthew Barnet counterclaims that examples like GPT-4 show that values are easier learned than previously thought – see here ↩︎
Also see the Less-Wrong post: Complexity of Value. ↩︎
See writing on Aligning AI to Moral Realism ↩︎
See Peter Railton’s lecture ‘A World of Natural and Artificial Agents in a Shared Environment‘ – I’m sure I heard Eliezer Yudkowsky asking a question off camera. ↩︎
The whole paragraph: “To speak of building an AGI which shares “our values” is likely to provoke negative reactions from any AGI researcher whose current values include terms for respecting the desires of future sentient beings and allowing them to self-actualize their own potential without undue constraint. This itself, of course, is a component of the AGI researcher’s preferences which would not necessarily be shared by all powerful optimization processes, just as natural selection doesn’t care about old elephants starving to death or gazelles dying in pointless agony. Building an AGI which shares, quote, “our values,” unquote, sounds decidedly non-cosmopolitan, something like trying to rule that future intergalactic civilizations must be composed of squishy meat creatures with ten fingers or they couldn’t possibly be worth anything—and hence, of course, contrary to our own cosmopolitan values, i.e., cosmopolitan preferences. The counterintuitive idea is that even from a cosmopolitan perspective, you cannot take a hands-off approach to the value systems of AGIs; most random utility functions result in sterile, boring futures because the resulting agent does not share our own intuitions about the importance of things like novelty and diversity, but simply goes off and e.g., tiles its future light cone with paperclips, or other configurations of matter which seem to us merely “pointless.”” ↩︎
On the terrifying boringness of random utility functions producing non-valuable outcomes See pages 11-12 link. ↩︎

Complex Value Systems are Required to Realize Valuable Futures – A Critique