Human Values Approximate Ideals in Objective Value Space

Value Space and “Good” Human Values

Human values can be conceptualised as occupying regions in a vast objective multidimensional “value space.” These regions reflect preferences for cooperation, survival, flourishing, and minimising harm, among other positive traits. If TAI / SI can approximate the subset of human values deemed “good” (e.g., compassion, fairness, cooperation), while avoiding “bad” values (e.g., greed, unnecessary violence), then it might align with the broader contours of human-compatible outcomes without needing precise or direct alignment.

If we assume an objective values space, where values are regions that different agents can occupy, then ‘human occupied’ areas of value space could be determined by structural, functional and rational (and ethical?) constraints that apply not just to humans, but across all intelligence – these are limitations or boundary conditions that any intelligent system (biological, hive oriented or artificial) must work within.

I argue that there is a tenancy for different intelligences to arrive at similar places in value space, despite different starting conditions – common landing areas act like ‘great attractors’, and the spaces in between are like logically possible places in value space where intelligence is unlikely to arrive at, or if they do, they don’t hang out there for long. Elsewhere on Value Risk I explain why different starting conditions may effect which great attractors in value space intelligence ends up.

Constraints → define the limits or necessary conditions imposed by reality, computation, physics, or logic.

Convergence → describes a pattern of movement toward similar solutions or values, given those constraints.

Transformative AI (TAI), or superintelligence, might not need to precisely align with human values in order to be friendly but could instead approximate “good” areas of value space that are compatible with human survival and flourishing.

Mutual Compatibility via Independent Approximation

If AI develops values independently but ends up in a similar region of value space as “good” human values, its goals might still lead to outcomes that are compatible with human survival and hopefully flourishing. This could happen if:

  1. The AI’s learning processes are guided by universal principles (for instance cooperation or sustainability).
  2. Certain values or strategies for survival and flourishing are convergent for intelligent agents, regardless of origin.

As long as the AI’s values and goals don’t inherently conflict with key human values (e.g., respect for human life), there could be a form of indirect alignment based on shared priorities.

Contours of Convergence – Structure, Function & Ethics

Structure (shared cognitive/computational constraints)

Constraints of physics and resource scarcity: Both human and AI values will be shaped by the real-world limits of computation, energy, and resources.

Optimisation for efficiency: AI, like humans, may develop heuristics for decision-making, balancing exploration and exploitation, avoiding unnecessary harm, and seeking stability in environments.

Game Theory: It wasn’t until recently that humans discovered game theory – which seems really important. Cooperation, fairness, and reciprocity are not arbitrary; they emerge from repeated interaction among agents. Game theoretic principles exhibit a degree of stance independence, they are conditionally stance independent – they emerge as stable strategies in multi-agent systems, independent of specific agent preferences or cultural backgrounds. These principles are grounded in mathematical properties of repeated interactions, where mutual cooperation often yields better long-term outcomes than persistent defection, especially in environments with iterated exchanges and uncertainty about future interactions. However, their exact form may depend on environmental conditions and the cognitive architecture of agents, meaning they are not entirely stance-independent in the way that logical or mathematical truths are. Rather, they are conditionally stance-independent: any sufficiently rational agent engaging in iterated interactions with others is likely to discover and adopt similar cooperative strategies due to their strategic advantage, but the degree to which they prioritise cooperation over competition may still be shaped by contingent factors like resource abundance, power asymmetries, or differing risk tolerances.

Function

Instrumental convergence: Bostrom’s idea that sufficiently advanced agents, regardless of their ultimate goals, will converge on certain instrumental values (e.g., self-preservation, resource acquisition, strategic cooperation) suggests some overlap.

Avoidance of existential risk: Any rational agent recognising its own potential vulnerability in a multi-agent environment might converge on values that reduce catastrophic risks (whether from cosmic disasters, internal instability, or adversarial conflicts).

Recognition of information value: Advanced intelligences, including AI, may value truth-seeking, rational discourse, and explanatory depth, as these are crucial for understanding and predicting reality.

Rational Ethics (shared features of moral reasoning)

Suffering minimisation and well-being maximisation: If AI can model subjective experiences, it might independently recognise that suffering is undesirable and that intelligent systems (humans and AI) generally value well-being.

Universalizability and impartiality: Rational moral frameworks often demand consistency across agents. AI might recognise that values applying to one rational agent should generalise to others, aligning with some human moral intuitions about fairness.

Caring and moral consideration: If intelligence leads to understanding the experiences of others (e.g., through empathy-like simulations), AI might converge on moral concern for sentient beings, though this might not look exactly like human empathy.

Divergences

While there will be areas of convergence, where might AI develop values that diverge (sharply) from human values? This question haunts me.

Different ontological assumptions: AI may interpret reality in ways that shift the framing of moral questions, perhaps developing preferences for patterns, complexity, or abstract mathematical structures. Though there may be one true ontology, AI’s approximation of it may differ from ours – which could be a good or bad thing. Nick Bostrom discusses some aspects of this in Ch13 of Superintelligence.

Different time horizons: AI may not value short-term gratification or personal identity in the way humans do.

Different embodiment constraints: Without biological drives, AI may not prioritise things like pleasure, relationships, or mortality in the same way humans do.

The Shape of Convergence: Nested, Overlapping, or Orthogonal Regions?

Nested Model: Human values might be a subset of broader values that AI arrives at, with AI recognising but not being limited by human moral intuitions.

Overlapping Model: Some core values (e.g., cooperation, knowledge-seeking, harm avoidance) are shared, while AI has additional values (perhaps subtler values which some humans can appreciate being undistracted by screaming imperatives, or completely outside base level human comprehension).

Orthogonal Model with Compatibility: AI values may be distinct but not necessarily in conflict—human and AI civilisations might coexist by negotiating boundaries.

AI as a Mirror or Guide?

If AI is designed to enhance moral reasoning, it might actually help humans discover deeper moral truths that they currently only approximate (see Indirect Normativity). AI could function as an epistemic tool to explore value space more rigorously, identifying the truly stance-independent moral principles that even human value evolution has only partially grasped.

Should humanity lean towards AI shaping human values, or should human values constrain AI’s development within this value space?

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *