Bias in the Extrapolation Base: The Silent War Over AI’s Values
Nick Bostrom addresses concerns about biased influence on the extrapolation base in his discussions on indirect normativity (IN – i.e., an approach where the AI deduces ideal values or states of affairs rather than running with current values and states of affairs) esp. coherent extrapolated volition (CEV) and value alignment in his magnum opus Superintelligence. This represents a clear value risk (VRisk) that we should seek to avoid.
Applicability to current AI systems
In recognizing that AI alignment isn’t just a technical problem but also a profoundly philosophical one involving ethics and the nature of values, we should be concerned with what values are being loaded into AI. Current AI systems (like large language models) are being aligned with human feedback right now, and are being trained on very very large corpuses of human writing – and from this one could assume that these current AIs are extrapolating some kind of volition. Current AI alignment approaches (like RLHF – Reinforcement Learning from Human Feedback, or constitutional AI) attempt to encode certain values or rules. Though CEV and IN might be a considered future approaches requiring far more sophisticated AI than we have now – this writing might hint at how its ideals could inform or critique current-day methods – for instance, does the “broad extrapolation base” idea suggest we need a more global and representative approach to RLHF?
The CEV proposal, … is of course the merest schematic. It has a number of free parameters that could be specified in various ways, yielding different versions of the proposal.
One parameter is the extrapolation base: Whose volitions are to be included? We might say “everybody,” but this answer spawns a host of further questions. Does the extrapolation base include so-called “marginal persons” such as embryos, fetuses, brain-dead persons, patients with severe dementias or who are in permanent vegetative states? Does each of the hemispheres of a “split-brain” patient get its own weight in the extrapolation and is this weight the same as that of the entire brain of a normal subject? What about people who lived in the past but are now dead? People who will be born in the future? Higher animals and other sentient creatures? Digital minds? Extraterrestrials? – Nick Bostrom, Superintelligence
Whom to include in the moral calculus? What is the correct, or most permissible criteria or philosophy of inclusion beyond “probably everyone.” What counts as a person is still an ongoing debate – though I think we should make generous assumptions to the extent practical. Personhood theories abound – I’m partial to the inclusive nature of Peter Singer’s expanding circle (which includes non-human animal sentience). If we accidentally leave out we might be able to rectify with the aid of superintelligence down the road – as long as a ‘final solution’ isn’t locked in. I’m pretty sure we don’t need to include rocks.
Whose volitions are to be included?
Bostrom defines the extrapolation base whose volitions are included in an extrapolated process – volitions in this sense are: set of human values, beliefs, and preferences that an AI would use to generate its future moral framework under an extrapolation process (e.g., CEV). He emphasizes that who or what gets included in this base is a crucial decision, as it determines whose values get amplified and preserved in the long run.
One motivation for the CEV proposal was to avoid creating a motive for humans to fight over the creation of the first superintelligent AI. Although the CEV proposal scores better on this desideratum than many alternatives, it does not entirely eliminate motives for conflict. A selfish individual, group, or nation might seek to enlarge its slice of the future by keeping others out of the extrapolation base. – Nick Bostrom, Superintelligence
Diverse vs narrow extrapolation bases
Imagine an AI is tasked with making global policy recommendations on climate change, healthcare, and economic systems. If its extrapolation base consists only of present-day adult humans, it might prioritize short-term human well-being over long-term environmental sustainability. However, a diverse extrapolation base could include:
- Future generations → Ensures long-term sustainability is considered.
- Non-human animals → Weighs the suffering of factory-farmed animals or wildlife affected by climate change.
- Digital minds (if they exist) → Ensures AI doesn’t disregard synthetic consciousness.
- Historical humans → Incorporates enduring moral insights but prevents outdated prejudices from dominating.
- Children and marginalized individuals → Avoids over-weighting the perspectives of dominant groups.
A narrow extrapolation base (e.g., only current voters in Western democracies) would bias AI’s recommendations toward the short-term interests of those groups. A diverse extrapolation base makes the AI’s moral reasoning more broadly representative, temporally robust, and adaptable to new forms of moral consideration.
Imagine representatives from all intelligent species (congregating behind something akin to Rawl’s veil of ignorance) – humans, sentient AI, extraterrestrials, uplifted animals, hive-minds – gathering to draft the ultimate ethical charter for a shared future. Each species has different needs, cognitive structures, and value systems.
- If only the first species to space travel (e.g., humans) writes the rules, they might impose human-centric ethics that fail to account for radically different forms of intelligence.
- A truly diverse extrapolation base would ensure all stakeholders—biological, digital, and extraterrestrial—have input, preventing moral chauvinism and ensuring the ethical system remains fair across different types of minds.
The idea behind having a diverse extrapolation base is to help rig the outcome such that AI doesn’t narrowly optimize for a single demographic or species but instead discovers principles that hold across diverse forms of sentience, time periods, and perspectives. The more limited the base, the higher the risk of entrenching biases and overlooking broader moral considerations.
Dangers of having bias in the extrapolation base
He explicitly warns against attempts to bias the extrapolation base for self-serving reasons, identifying key risks:
- Power Struggles & Strategic Manipulation
- If different groups or individuals attempt to influence the initial value selection process to favour their own interests, the resulting AI alignment could reflect narrow sectarian value & goals rather than a fair or universally beneficial outcome.
- Groups with more power or access to AI development might try to tilt the system in their favour, embedding their ideological, national, or political biases into the AI’s moral architecture.
- Lock-In of Unrepresentative Values
- A biased extrapolation base risks locking in non-representative or outdated values that do not reflect the broader or future moral consensus.
- If an AI extrapolates from a skewed or manipulated base, it may enforce value systems that entrench existing power structures rather than track actual moral progress.
- We should be mindful of moral evolution – our values can improve over time and that AI should track that improvement rather than freeze current prejudices.
- The Difficulty of Neutrality & Fair Extrapolation
- Even well-intentioned attempts to define a “neutral” base may inadvertently privilege some perspectives over others due to cultural, cognitive, or institutional biases.
- Nick discusses the meta-level challenge of designing an AI that fairly interprets and integrates diverse human values without falling into the trap of overfitting to particular groups’ preferences. Iason Gabriel discusses value pluralism in his paper ‘Artificial Intelligence, Values, and Alignment‘. One might argue that we should involve seasoned moral philosophers like Peter Singer, Peter Railton, Graham Oppy and other moral realists.
Meta-Level Challenges
Defining a neutral extrapolation base (since even well-intentioned choices can favour some perspectives) is difficult – I think a good approach is to “stack the deck” with moral realist philosophy, though this could be seen as inconsistent with neutrality. By favouring moral realist philosophers, it may look like implicitly choosing a particular meta-ethical stance to guide the AI at the risk of potentially biasing the process toward that framework – however, moral realism’s universality seems very compatible with the goal of impartiality. (Note: I’d be interested in exploring other universalist approaches in ethics here at some stage – and that Nick Bostrom in his book Superintelligence refers to ‘moral rightness’ – which is whatever is morally right without specifying the details or pointing to a particular meta-ethical stance).
A deontologist or a proponent of human rights might worry that aggregating volitions could trample certain inviolable rights or principles – if the majority’s extrapolated volition wanted something that undermines a minority’s rights, how would that be handled? (Bubble worlds?)
An argument from anti-realists is that there may be no single “true” moral trajectory for AI to discover – instead, we might have to negotiate values politically rather than find them scientifically. Though I find this hard to swallow (I really should interview Kane Baker in his views on effective approaches to achieving “wouldn’t it be nice” AI).
The effective altruism as a movement has itself debated how much to weigh current preferences versus idealized future preferences – however, if we start with the grounding of current preferences, succeed at existential security, then we (AI + us?) have the stability and time to engage in iterated long reflections to help answer the distant call of idealized future preferences.
A Rawlsian approach (ensuring justice and fairness through a veil of ignorance), which could also be relevant when deciding whose values to include – especially if we don’t limit it to human values and include representative values from all intelligent and/or sentient species.
People often assert that a wide range of perspectives will avoid bias in an ostensibly neutral value aggregation process, which is intuitively logical, but doesn’t acknowledge the potential challenge that some values are fundamentally at odds. Broad inclusion of everyone’s values may be extremely difficult given the multitude of edge cases (outside of AI generating bubble worlds for each coherent set of values).
Proposed Solutions
Bostrom suggests that rather than allowing self-interested groups to shape the extrapolation base, we should:
- Use a broad and diverse extrapolation base: Include as wide a range of moral perspectives as possible to avoid overrepresentation of any single ideology or group.
- Prioritize an indirect normativity approach: Let AI discover what we would want if we were wiser, more informed, and had more time to reflect—rather than just amplifying current opinions. I have written about this in a number of articles, check them out.
- Implement safeguards against value corruption: Ensure that mechanisms exist to detect and counteract attempts to bias the system for short-term gains.
Issues
Many in the AI ethics community have raised concerns about whether idealized “volitions” truly capture moral truth or just paternalistically override what people currently value. Some might argue that trying to please everyone (inclusion of all values) could lead to a muddled or least-common-denominator ethic that might be ineffective. I don’t have a great answer here, except to say that we should temper the extrapolation base with other approaches. I mentioned moral realism, but what of alternate meta-ethical stances? What if moral relativism ends up being important? What if moral anti-realism is true?
Others might think a benevolent dictator approach to AI values (though risky) could yield a more coherent value framework for the AI (see Nick Bostrom’s turnkey totalitarianism discussion in his Vulnerable World Hypothesis paper). This could be true, though even a benevolent dictator could start with a coherent value system at first, then once an acceptable degree of existential security is achieved, extrapolate the volition.
How an AI might reconcile radically divergent human values, or what happens if the “extrapolated” values conflict with some individuals’ cherished principles? Ben Goertzel has attempted to address the problem of conflicting volitions in his models: coherent aggregated volition (CAV) and coherent blended volition (CBV) – more detail is covered in this writing on philosophical groundwork to indirect normativity.
Note, there could be more said on tensions between utilitarian aggregation and deontological rights, or how to handle moral uncertainty in the context of extrapolating volition, though I appeal to the swiss cheese methodology of AI safety and security even at the meta-level – and hope that AI can help us resolve deeper meta-ethical concerns during the process of indirect normativity.
Technical challenges?
I’ve been silent on how an AI would actually gather and reconcile the values of all humanity (let alone animals or future people) – this piece isn’t meant to be an implementation plan.
Policy implications
There are significant policy implications which could be implemented in the near-term – the risk of “power struggles & strategic manipulation” over AI’s values – is essentially a governance problem – for instance, if the concern is that a particular nation or company could skew an AGI’s values, should there be an international treaty or oversight board to ensure a fair extrapolation base? If we are to achieve a broad inclusion and safeguards, who will implement or enforce these?
We should call for cooperation including as wide a range of moral perspectives as possible, concrete policy mechanisms like UN guidelines, multi-national panels, or even community deliberation processes.
Additionally, given the possibility of moral lock-in of outdated values one might infer we need policies that allow AI systems to update as moral consensus evolves. In the near term it is recommended that there are continuous oversight or periodic re-evaluation of the AI’s values and goals by human authorities.
Though in the long run I hope this will result in the kind of AI’s which are inherently more moral and impartial than humans.
Recently current AI policy debates (such as calls for AI ethics committees, value surveys, or public participation in AI development) have taken a bit of a hit recently after the Paris AI Action Summit. However, the problem is still relevant. I am urging that any project developing a superintelligent AI should operate under international supervision to ensure no group is excluded from the extrapolation base.
Conclusion
Bostrom’s core concern is that who controls the extrapolation base determines the trajectory of AI-driven moral progress. Attempts to manipulate it for self-serving purposes risk moral lock-in, loss of long-term flexibility, and AI systems that serve the interests of a few rather than the good of all. His solution is to design AI with meta-rationality (reasoning about reasoning), allowing it to discover values in an impartial and evolving way rather than hardcoding or extrapolating from a biased starting point.
The initial selection of values for a superintelligent AI could become a battleground – this is concerning – this highlights the importance of impartiality and inclusiveness in that process. Aligning AI with human values is not just a technical challenge but also a deeply moral and political one.
Definitions
Meta-rationality is the ability to reason about reasoning itself—to evaluate and refine one’s own cognitive processes, epistemic frameworks, and decision-making strategies. It involves not just applying rationality but questioning and improving the criteria by which rationality operates.
Indirect normativity: a way to specify the values of artificial intelligence (AI) indirectly. It’s an approach to the AI alignment problem.
Coherent extrapolated volition: originally proposed by Eliezer Yudkowsky in 2004 – “our wish if we knew more, thought faster, were more the people we wished we were, had grown up farther together; where the extrapolation converges rather than diverges, where our wishes cohere rather than interfere; extrapolated as we wish that extrapolated, interpreted as we wish that interpreted”