The Architecture of Value
Implications for AI Safety and Indirect Normativity
The challenge of understanding value space has moved from a philosophical curiosity to an urgent technical problem. As we approach the development of advanced artificial intelligence, the structure of value and our ability to specify it has become perhaps the most critical challenge facing humanity. This challenge assumes that there is an objective value space – which seems to be the dominant view among philosophers (see philpapers surveys 2020), is supported by convergent evolution of ethical intuitions, game theorists have identified stable equilibria which agents tend towards, and the shared anatomical foundations of pain and pleasure – all this suggests that we’re not merely mapping human preferences but uncovering fundamental features of reality itself.
The project of AI alignment fundamentally depends on our ability to understand and specify values with sufficient precision that we can encode them into artificial systems. Yet we face a paradox: we must specify values precisely enough to guide superintelligent systems while acknowledging that our current understanding of value is incomplete and possibly flawed (we disagree on which values are best, and our value systems are incoherent). This is where indirect normativity becomes crucial.
Indirect normativity, as developed by philosophers like Nick Bostrom (Nick Beckstead and William MacAskill too?) and decision theorist Eliezer Yudkowsky suggests that instead of trying to directly specify all our values, we should create systems that can learn and extrapolate values in alignment with moral reality. This approach acknowledges both moral realism and our epistemic limitations. Just as science progressively uncovers the laws of physics, AI systems might help us uncover the objective structure of value space.
The mathematics of value learning offers a framework for this endeavor. Stuart Armstrong’s work on value learning suggests that while individual human values might be complex and contradictory, there exists an underlying structure that could be learned through careful observation and inference. This connects with moral realism in a profound way: if moral facts are real, then value learning becomes not just preference aggregation but actual discovery of moral truth.
Several key dimensions of this problem space demand attention:
The Structure of Moral Reality
If moral realism is true, value space isn’t merely a map of possible preferences but a territory with real features. Some regions of this space represent objective moral truths, while others represent various degrees of departure from these truths. This suggests that value learning should be understood not as preference learning but as moral discovery.
Computational Tractability
The notion that we can map value systems might seem, at first blush, as quixotic as attempting to count the stars. Yet, just as astronomy progressed from naked-eye observations to sophisticated models of the cosmos, our understanding of value systems has evolved from intuitive philosophy to systematic investigation. This progression reveals an underlying structure to what was once thought purely subjective.
The challenge of navigating value space becomes more concrete when we consider implementation. Even if moral facts exist, the space of possible value configurations might be vast or infinite. This raises crucial questions about how to design AI systems that can efficiently explore and learn from this space while avoiding dangerous local optima.
Robustness and Corrigibility
The instrumentally convergent subgoal of goal content integrity presents one of the deepest challenges in AI alignment. Any sufficiently intelligent system will seek to preserve its current goals to better achieve them in the future – leading to what we might call the “goal integrity trap”: systems become increasingly resistant to legitimate modification as they become more capable of preventing such modifications.
The concept of goal content integrity takes on new dimensions when we extend it to value learning systems. We can call this extended concept “value content integrity” – the preservation and refinement of learned values while maintaining corrigibility – the epistemic humility to update based on new evidence or better arguments. This creates what we might call the “corrigibility paradox”:
- Strong value content integrity is necessary to prevent value drift and maintain alignment
- Strong corrigibility is necessary to allow value learning and correction of mistakes
Any system attempting to learn values must maintain corrigibility—the ability to be safely modified or corrected. This creates a paradox: how do we create a system stable enough to maintain its core directives while flexible enough to update its understanding of value as it learns more about moral reality?
Success requires threading a narrow path between excessive rigidity and dangerous flexibility, creating systems that are stable enough to maintain alignment while flexible enough to grow in understanding of moral reality.
The Role of Consciousness
If moral realism is true and consciousness plays a fundamental role in value (as seems likely), then value learning systems must grapple with questions of consciousness and qualia.
It may or may not be that advanced AI systems might benefit from some form of phenomenal consciousness to fully understand and implement moral values.
Indirect Normativity Implementation (?)
The practical implementation of indirect normativity requires solving several technical challenges:
- Value Learning Architecture: Designing systems that can learn from human behavior while accounting for human irrationality and biases
- Moral Parliament Algorithms: Creating decision procedures that can aggregate different moral theories and update them based on new evidence
- Meta-Preference Learning: Developing systems that can learn not just object-level preferences but the rules for how preferences should be updated
- Robustness to Distribution Shift: Ensuring systems maintain value alignment even as they encounter novel situations
A moral realist AI safety suggests that our task is not merely engineering but discovery. We’re not just building systems to implement our values; we’re building systems to help us better understand and align with moral truth as it is discovered, or with increasing accuracy more closely approximate ideal moral truth (without being idealist).
Practical Implications
This perspective has immediate implications for current AI development:
- We should prioritize systems that maintain uncertainty about values and can update their understanding
- Value learning should incorporate multiple levels of abstraction, from direct preferences to meta-ethical principles
- Systems should be designed to recognize and preserve option value regarding different moral theories
- Development should focus on corrigible (epistemically humble) systems that can be safely modified as our understanding of moral reality improves
The Path Forward
The convergence of AI safety, value learning, and moral realism suggests a research agenda:
- Develop formal models of indirect normativity that can be implemented in AI systems – Paul Christiano has done some work here
- Create experimental frameworks for testing value learning algorithms against objective moral criteria
- Investigate the relationship between consciousness and value learning
- Design robust architectures for maintaining value alignment during recursive self-improvement
The stakes could not be higher. If moral realism is true and we’re approaching the development of superintelligent AI, we’re not just facing a technical challenge but a philosophical one of cosmic significance. We must develop systems that can not only learn our values but help us discover and align with objective moral truth.
The space of possible value configurations isn’t just a map of human preferences—it’s a fundamental feature of reality that we must understand to ensure the positive future of consciousness in our universe. As we develop increasingly powerful AI systems, our ability to navigate this space while maintaining alignment with true values may determine the fate of all value-capable entities in our light cone.
Navigating Value Space
Consider how we navigate physical space: through dimensions like height, width, and depth. Value space has its own dimensions. Humans may not understand all dimensions, however Jonathan Haidt’s Moral Foundations Theory identifies six basic moral axes—care/harm, fairness/cheating, loyalty/betrayal, authority/subversion, sanctity/degradation, and liberty/oppression. Different cultures and individuals plot their positions along these dimensions, creating distinct moral fingerprints. A libertarian’s value space might spike strongly on the liberty axis while showing less concern for authority, while a traditional conservative might present the inverse pattern.
Just as three spatial dimensions combine to create a potentially infinite variety of physical structures, these moral dimensions interweave to generate a combinatorial explosion of possible value systems. The mathematics of this moral space is dizzying. Each dimension isn’t simply binary but continuous, creating a hyperdimensional landscape of staggering complexity.
The challenge of mapping this territory has attracted a diverse array of explorers, some of who have appeared on the STF YouTube channel. Philosophers like Stuart Armstrong write about decomposing human values into fundamental components, much as physicists seek elementary particles. AI researchers, driven by the imperative to create aligned artificial intelligence, work to formalize value learning through mathematical models. Anthropologists study the actual value systems that have emerged across human cultures, providing empirical data points in this abstract space.
What makes this mapping project particularly fascinating is its recursive nature. The values we hold influence how we think we should map values, which in turn affects our understanding of value space itself. It’s as if the cartographers’ beliefs about geography could reshape the very continents they were mapping.
Yet patterns emerge from this complexity. Certain configurations of values appear more stable, more internally coherent, or more evolutionarily successful than others. Some value combinations seem to be human universals, appearing across cultures and throughout history. Others appear to be logical impossibilities, like trying to simultaneously maximize individual freedom and total social control.
This mapping project has profound implications. In artificial intelligence, understanding the structure of value space is crucial for creating systems aligned with human values. In politics, it could help us understand why certain ideological combinations persist while others fade. In ethics, it might reveal new possibilities for moral progress by identifying unexplored regions of value space.
Critics might argue that reducing values to mappable dimensions strips them of their essential nature—their felt quality, their moral force. But this misunderstands the project’s aim. We’re not trying to reduce values to mere coordinates, any more than mapping the brain reduces consciousness to mere neural firing patterns. Rather, we’re seeking to understand the structure that underlies our moral intuitions and choices.
The space of possible value configurations might be near infinite, but it’s not formless. Like the physical universe, it appears to have laws, patterns, and constraints. Understanding these patterns won’t diminish the richness of good human values; it will enhance our ability to navigate moral choices and understand ourselves.
As we continue to explore this territory, we might discover that the space of possible values is both larger and more structured than we imagined. Some regions of this space might remain forever inaccessible to human minds, while others might represent possibilities for moral growth we haven’t yet conceived. The map is not the territory, but in the case of value space, creating better maps might be essential for navigating our moral future.
The project of mapping value space reminds us that morality, like language or cognition, has an underlying architecture—one that we can study, understand, and perhaps even improve upon. In doing so, we might not just better understand what we value, but why we value what we value, what other values might be possible, and of these what we ought to value – you heard it right – ought from is.
This exploration of value space might be one of the most important intellectual projects of our time. As we develop increasingly powerful technologies and face increasingly complex moral challenges, understanding the architecture of value becomes not just philosophically interesting but practically essential. The better we understand the space of possible values, the better equipped we’ll be to make choices that align with our deepest principles and aspirations.