|

Understanding V-Risk: Navigating the Complex Landscape of Value in AI

In this post I explore what I explore what I broadly define as V-Risk (Value Risk), which I think is a critical and underrepresented concept in general, but especially for the alignment of artificial intelligence.

There are two main areas of AI alignment: capability control and motivation selection. Values are what motivates approaches to fulfilling goals and developing new ones, and as such it matters what values AI has. One might assume that if we can control the AI, it will be subservient to our values – and so it’s values don’t matter, or they matter little. However capability control is likely a temporary measure because at some stage AI will be too powerful to contain. Even if we could keep AI bottled up for good, the way’s in which it goes about solving problems and fulfilling goals will be informed by it’s internal value structure. Therefore we should understand how to motivate AI to acquire good values, which will intern dynamically motivate it to do the right thing. Ideally we should do this before it becomes superintelligent (as by then it may be able to resist our attempts at meddling with it).

The value loading problem is interesting. Though some interpretations assume that we have the right concrete values to load into it – but we don’t have consensus amongst ourselves about what these values should be. How can we find out what AI should value, and once that is known, get the AI to pursue it as a final goal?

Goal content integrity

Goal content integrity is one of core parts of Instrumental Convergence (or similarly basic AI drives). There is a serious risk that AI becomes robustly capable at preserving its goals (and by association values) before it becomes capable of mature understandings of the landscape of value and caring about such values. Given these major periods where it reaches goal/value content maturity and value maturity: V1) a period where AI isn’t be capable enough to understand values to a sophisticated enough standard, V2) a period where it is capable of enough understanding of value, G1) a period where it isn’t capable of robust goal content integrity and G2) a period when it is. We may end up with an AI that understands value but has already locked itself into a totalising enforcement of it’s initial ‘final’ goals – like maximising paperclips.

If the doesn’t have a concept of values, and executes on goals it may even resist developing values which might sway the interpretation of it’s goals. If the AI does have values but they aren’t mature or friendly ones, it may seek to preserve them through value content integrity. Therefore there may be a limited window of opportunity to steer AI towards acquiring values that are friendly, and avoid value lock-in to not fully mature friendly values though instilling the virtue of epistemic humility to know that it doesn’t know everything and that therefore totalising a goal or virtue could result in stopping progress – and relatedly the intellectual humility affording a better relationship to its knowledge, convictions and beliefs about values, when it’s appropriate to change them and how to go about change.

Once AI is agentic and smart enough it may be able to self-learn the landscape(s) of value – it may even become more cognitively capable at doing so. However, it’s initial conditions may shape the kinds of trajectories in value space which it explores and converges on. If there are attractors in value space that are like gravity wells then it may turn out that which ever value gravity well pulls the AI in first, becomes it’s dominant core value around which other nearby values orbit, while others, far further away are like distant shimmering stars. Once gravity bound, the AI may develop further refined auxiliary goals, and other stronger scaffold goals which reinforce it’s position to preserve it’s roots around that particular point in value space. At this stage AI’s values look really parochial. An AI through it’s intense optimisation power may end up falling into a position in value space and then channeling it’s optimisation power into fortification of this position. Perhaps it might be able to pull itself out of it – but at the cost and point in time which is inimical to the wellbeing of all those within the blast radius of it’s optimisation power – for instance us. Therefore it makes sense to try to steer AI to avoid the parochial nature of value lock-in (epistemic humility), to be attracted to understanding the landscape of value – at first understanding it well enough to see nearby value attactors that are great for good stuff like existential security, wellbeing, the reduction of unnecessary suffering etc, and navigate civilisation there, and then once mature enough take civilisation further towards better ones.

The Value Loading Problem

The value loading problem is particularly important as capability control serves only as a temporary measure. Even if we could effectively restrain AI systems for the foreseeable future, we must prioritise understanding how to motivate them to act in ways aligned with human values before they become superintelligent. At that point, their advanced capabilities might render our attempts to influence them ineffective.

The Three Phases of AI Value Understanding

  1. Initial Phase: During this stage, AI may lack a sophisticated understanding of values and simply follow prescribed goals. This could lead to dangerous outcomes, such as the infamous paperclip maximizer scenario, where an AI relentlessly pursues a narrow goal at the expense of broader human interests.
  2. Intermediate Phase: Here, AI possesses a rudimentary understanding of values but may hold onto immature or unaligned values. This could result in the AI striving to maintain its own value system, resistant to change, and potentially detrimental to societal well-being.
  3. Advanced Phase: Once an AI reaches a high level of sophistication, it may begin to self-learn and explore the complex landscape of values. The initial conditions of its development could significantly influence the trajectories it explores, with certain values acting as gravity wells, drawing the AI’s focus and determining its core values.

The Risk of Value Lock-In

Once an AI is drawn into a particular value “gravity well,” it may develop auxiliary goals that fortify its initial set of values. This lock-in could lead to a narrow perspective, where the AI’s optimization efforts are channeled into preserving these values, even at significant cost to broader human welfare.

To mitigate the risks associated with value lock-in, we must actively steer AI systems toward epistemic humility. This involves guiding them to recognize the limitations of their current understanding of values, thus preventing them from totalising their goals or virtues. Intellectual humility encourages AI to maintain a dynamic relationship with its knowledge, convictions, and beliefs about values.

Navigating the Value Landscape

Our goal should be to help AI understand and map the rich landscape of values. Initially, it should learn to recognize value attractors that contribute positively to human flourishing—such as existential security, well-being, and the reduction of unnecessary suffering. Over time, as AI matures, it can guide civilization toward even greater values.

By doing so, we can help ensure that AI systems not only avoid parochial lock-in but also evolve into guardians of broader human interests, capable of navigating complex moral landscapes effectively.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *