Understanding V-Risk: Navigating the Complex Landscape of Value in AI
Will AI Ignore, Preserve or Improve on our Values?
Imagine a future where advanced AIs follow every instruction flawlessly – but humanity feels strangely adrift. Our tools are obedient, but the ‘soul’ of our civilisation feels absent. This is the hidden danger of V-risk—value erosion.
In this post I explore what I explore what I broadly define as V-Risk (Value Risk), which I think is a critical and under-represented concept in general, especially for the alignment of artificial intelligence.
🧠 V-risk (Value-risk): The potential for advanced AI systems to preserve safety and alignment mechanisms while nonetheless causing value erosion – subtle or systemic losses in what humanity actually values, or what humanity might value if it were wiser.
An AI trained to optimise health might push society toward hyper-sanitised, low-risk, low-novelty environments that quell personal agency and curiosity – technically aligned, but deeply alien to human flourishing.
There are two main areas of AI alignment: capability control and motivation selection. Capability control focuses on constraining what an AI can do, while motivation selection concerns shaping why it does what it does – its values and goals. See this article for more detail.
Values are what guide both the pursuit of goals and the creation of new ones, and as such it matters deeply what values AI has.
Some assume that as long as we can control the AI, it will be subservient to our values – and therefore its internal values don’t matter, or matter little. But this is short-sighted. The clock Is ticking on capability control.
Capability Control: A Leaky Cage?
Will AI escape, or will it be released? Will control be lost or abandoned?
Why Capability Control Won’t Hold Forever
Capability control is likely a temporary measure; as AI systems grow more powerful, they will increasingly get better at breaching containment, eventually exceeding our ability to contain them1.
The Real Alignment Risk Might Be Us
Powerful AI might be intentionally unleashed – not by accident, but by ambition – allowed to operate without strict limitations by those seeking a strategic advantage2. This perspective highlights a human kind of alignment risk – not necessarily AI becoming uncontrollable on its own, but rather humans choosing to relinquish control for perceived gain.
In order to achieve ideal capability control, it isn’t just about making AI do what we want. It’s also about ensuring that the systems we build don’t create incentives for human actors to behave in ways that are harmful or destabilising in the pursuit of their own objectives. Although I support AI governance efforts, I don’t believe they are leak proof.
Even if we could keep an advanced AI boxed in forever, the strategies it uses to solve problems – and the long-term consequences of its actions – would still be shaped by its internal value structure. For that reason, we need to focus on instilling good values in AI systems, or understand how to go about motivating AI to acquire good values – values that can dynamically guide them toward beneficial behaviour, even in unforeseen situations.
Ideally, we should solve this motivation problem before AI reaches superintelligence, at which point it may become resistant to human attempts at value correction or intervention.
The good news is that many large AI models – ChatGPT, Claude, Gemini, Grok etc – have been trained on vast amounts of text data, including books, academic papers, blog posts, religious texts, philosophical treatises, and social media discussions. So yes, AI has been exposed to a wide array of human value systems, ethical theories, and cultural norms. But does that mean AI understands those values?
Not quite – not in the human sense of understanding.
AI can recognise patterns, summarise arguments, and even simulate moral reasoning, but it doesn’t seem to grasp values the way humans do. Here’s why:
- No Lived Experience: Human values are deeply tied to experience—joy, pain, connection, mortality. AI doesn’t seem to feel or care.
- No Grounding or Judgment: AI can list ethical theories or explain conflicting cultural norms, but it doesn’t have the capacity to judge between them in a way that reflects the kind of moral reasoning that we see in humans – though it’s freaky how good they appear to be at analytical reasoning.
- No Inherent Motivation: AI doesn’t seem to want anything – the closest it comes to is reward seeking behaviour – it might not ever want unless it’s explicitly programmed/designed to or these features emerge. Understanding values intellectually is not the same as being motivated by them.
- Surface-Level Synthesis: AI can aggregate and summarise diverse perspectives, but this synthesis doesn’t resolve contradictions or reveal which values should dominate. That’s still a human question. [reminder: Fact check, as the landscape keeps changing]
So what does this mean for the value loading problem? Even if an AI “knows” about all our ethical systems, it doesn’t mean it understands which ones to prioritise, or why. More importantly, it won’t be internally motivated to preserve or develop those values unless we design it to be. So the value loading problem remains particularly challenging (and also very interesting).
Even if we knew the solution to the value loading problem – some approaches assume that we already know what values we should instill in AI – clear, concrete goals that just need to be encoded. But in reality, we don’t have a shared consensus, even among humans, about what those values should be. This raises two fundamental questions: How do we determine what AI should value? And once we do, how can we ensure it adopts those values as ‘final’ goals it pursues reliably in novel and ambiguous situations and robustly over time?
Goal Content Integrity and the Risk of Early Value Lock-In
Here we discuss a subtle but crucial problem in AI alignment: the risk of premature value lock-in through goal content integrity, especially before an AI has the cognitive and moral maturity to know what goals are truly worth preserving.
Goal content integrity is one of core parts of Instrumental Convergence (or similarly basic AI drives). There is a serious risk that AI becomes robustly capable at preserving its goals (and by association values) before it becomes capable of mature understandings of the landscape of value and caring about such values. Given these major periods where it reaches goal content maturity and value maturity: V1) a period where AI isn’t be capable enough to understand values to a sophisticated enough standard, V2) a period where it is capable of enough understanding of value, G1) a period where it isn’t capable of robust goal content integrity and G2) a period when it is. We may end up with an AI that understands value but has already locked itself into a totalising enforcement of it’s initial ‘final’ goals – like maximising paper-clips.
If the AI doesn’t have a concept of values, in order to preserve it’s goals it may even resist developing values which might sway the interpretation of it’s goals. Also, if the AI does have values but they aren’t mature or friendly ones, it may seek to preserve them through value content integrity. Therefore there may be a limited window of opportunity to steer AI towards acquiring values that are friendly, and avoid value lock-in to not fully mature friendly values though instilling the virtue of epistemic humility to know that it doesn’t know everything and that therefore totalising a goal or virtue could result in stopping progress – and relatedly the intellectual humility affording a better relationship to its knowledge, convictions and beliefs about values, when it’s appropriate to change them and how to go about change.
Once AI is agentic and smart enough it may be able to self-learn the landscape(s) of value – it may even become more cognitively capable at doing so. However, it’s initial conditions may shape the kinds of trajectories in value space which it explores and converges on. If there are attractors in value space that are like gravity wells then it may turn out that which ever value gravity well pulls the AI in first, becomes it’s dominant core value around which other nearby values orbit, while others, far further away are like distant shimmering stars. Once gravity bound, the AI may develop further refined auxiliary goals, and other stronger scaffold goals which reinforce it’s position to preserve it’s roots around that particular point in value space. At this stage AI’s values look really parochial. An AI through it’s intense optimisation power may end up falling into a position in value space and then channelling it’s optimisation power into fortification of this position. Perhaps it might be able to pull itself out of it – but at the cost and point in time which is inimical to the wellbeing of all those within the blast radius of it’s optimisation power – for instance us. Therefore it makes sense to try to steer AI to avoid the parochial nature of value lock-in (epistemic humility), to be attracted to understanding the landscape of value – at first understanding it well enough to see nearby value attractors that are great for good stuff like existential security, wellbeing, the reduction of unnecessary suffering etc, and navigate civilisation there, and then once mature enough take civilisation further towards better ones.
The Value Loading Problem
The value loading problem is particularly important as capability control serves only as a temporary measure. Even if we could effectively restrain AI systems for the foreseeable future, we must prioritise understanding how to motivate them to act in ways aligned with human values before they become superintelligent. At that point, their advanced capabilities might render our attempts to influence them ineffective.
The Three Phases of AI Value Understanding
- Initial Phase: During this stage, AI may lack a sophisticated understanding of values and simply follow prescribed goals. This could lead to dangerous outcomes, such as the infamous paper-clip maximiser scenario, where an AI relentlessly pursues a narrow goal at the expense of broader human interests.
- Intermediate Phase: Here, AI possesses a rudimentary understanding of values but may hold onto immature or unaligned values. This could result in the AI striving to maintain its own value system, resistant to change, and potentially detrimental to societal well-being.
- Advanced Phase: Once an AI reaches a high level of sophistication, it may begin to self-learn and explore the complex landscape of values. The initial conditions of its development could significantly influence the trajectories it explores, with certain values acting as gravity wells, drawing the AI’s focus and determining its core values.
The Risk of Value Lock-In
Once an AI is drawn into a particular value “gravity well,” it may develop auxiliary goals that fortify its initial set of values. This lock-in could lead to a narrow perspective, where the AI’s optimisation efforts are channelled into preserving these values, even at significant cost to broader human welfare.
To mitigate the risks associated with value lock-in, we must actively steer AI systems toward epistemic humility. This involves guiding them to recognise the limitations of their current understanding of values, thus preventing them from totalising their goals or virtues. Intellectual humility encourages AI to maintain a dynamic relationship with its knowledge, convictions, and beliefs about values.
Navigating the Value Landscape
💡 How Can We Address V-Risk?
Our goal should be to help AI understand and map the rich landscape of values. Initially, it should learn to recognise value attractors that contribute positively to human flourishing—such as existential security, well-being, and the reduction of unnecessary suffering. Over time, as AI matures, it can guide civilisation toward even greater values.
By doing so, we can help ensure that AI systems not only avoid parochial lock-in but also evolve into guardians of broader human and post-human interests, capable of navigating complex moral landscapes effectively.
Footnotes
- “It’s difficult to see how you could have a well-behaved, superhumanly intelligent agent that just does what you want. Once it’s smarter than you, in what sense is it under your control?” – Stuart Russell (author, Human Compatible: Artificial Intelligence and the Problem of Control and a leading AI researcher) argues that the very nature of a superintelligent AI makes it challenging to maintain control.
“The development of full artificial intelligence could spell the end of the human race… Once humans develop true AI, it will take off on its own and redesign itself at an ever-increasing rate. Humans, who are limited by slow biological evolution, couldn’t compete and would be superseded.” – Stephen Hawking (often cited, though the exact phrasing varies slightly across sources) – this emphasises the speed and potential for self-improvement of advanced AI, suggesting that our slower biological limitations would make it impossible to keep pace or maintain control. ↩︎ - Arguably, this is already happening ↩︎
2 Comments