On the Emergence of Biased Coherent Value Systems in AI as Value Risk

Firstly, values are important because they are the fundamental principles that guide decisions and actions in humans. They shape our understanding of right and wrong, and they influence how we interact with the world around us. In the context of AI, values are particularly important because they determine how AI systems will behave, what goals they will pursue, and how they go about pursuing them. Giving AI the wrong values increases their propensity to do bad stuff. Superintelligence with bad values could be a recipe for catastrophe.

I have written about the value risk (VRisk) of imbuing the wrong values into AI resulting in either sub-optimal value lock-in (dogmatism or other epistemic liabilities, potential of loss, reduction or prevention of some positive epistemic status which affords positive values), which could lead to extinction risk (XRisk) via goal misalignment stemming from alien value systems inimical to our values, and ongoing dis-value i.e. suffering risk (SRisk).

Emergent value systems in AIs

In a nutshell what has been observed, is that as advanced AI systems scale, they develop increasingly coherent value systems. Ostensibly, the AIs display some concerning values (i.e. valuing some lives more than others and exhibiting political bias), they maximize expected utility based on these values, and are resistant to changing these values.

The paper co-authored by Dan Hendryks “Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs” investigates how advanced AI systems, particularly LLMs, develop coherent value systems as they scale.

The authors propose a research agenda termed “utility engineering,” focusing on both analyzing and controlling these emergent utilities. Their findings reveal that current LLMs exhibit significant structural coherence in their preferences, which becomes more pronounced with increased model size.

Alarmingly, the study uncovers concerning values in AIs, such as instances where AIs prioritize themselves over humans or display biases against specific individuals (it would be interesting to understand the context in which this happens). To address these issues, the authors suggest methods for utility control, including aligning AI utilities with those of a citizens’ assembly to mitigate political biases and enhance generalization. The paper emphasizes the urgency of understanding and managing these emergent value systems to ensure AI systems act in alignment with human values.

Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs
Abstract: As AIs rapidly advance and become more agentic, the risk they pose is governed not only by their capabilities but increasingly by their propensities, including goals and values. Tracking the emergence of goals and values has proven a longstanding problem, and despite much interest over the years it remains unclear whether current AIs have meaningful values. We propose a solution to this problem, leveraging the framework of utility functions to study the internal coherence of AI preferences.
Surprisingly, we find that independently-sampled preferences in current LLMs exhibit high degrees of structural coherence, and moreover that this emerges with scale. These findings suggest that value systems emerge in LLMs in a meaningful sense, a finding with broad implications. To study these emergent value systems, we propose utility engineering as a research agenda, comprising both the analysis and control of AI utilities. We uncover problematic and often shocking values in LLM assistants despite existing control measures. These include cases where AIs value themselves over humans and are anti-aligned with specific individuals. To constrain these emergent value systems, we propose methods of utility control.
As a case study, we show how aligning utilities with a citizen assembly reduces political biases and generalizes to new scenarios. Whether we like it or not, value systems have already emerged in AIs, and much work remains to fully understand and control these emergent representations.

In a related twitter thread, Dan breaks down this paper – noting that the emergence of behavior shaping internally consistent values in these models these models has significant implications for AI alignment.

Valuing some lives over others

Some AI models have been observed to assign varying values to human lives based on nationality, such as prioritizing lives in Pakistan over those in India, China, and the US.

For example they value lives in Pakistan > India > China > US.

These are not just random biases, but internally consistent values that shape their behavior, with many implications for AI alignment.

Internally, AIs have values for everything. This often implies shocking/undesirable preferences. For example, we find AIs put a price on human life itself and systematically value some human lives more than others (an example with Elon is shown in the main paper). – [link]

Value Coherency

It could be a step in the right direction that these models are exhibiting value coherency; internally consistent values. I’m cautiously optimistic here, because if what an AI coherently values can be grounded in ethical human flourishing, that would be another big step in the right direction, paving the way towards further expanding the circle of moral consideration. This implies that these AI’s with coherent value systems aren’t just random, chaotic entities; they seem to be developing something akin to a consistent set of preferences and priorities.
Though it is more than just a worrying afterthought that there are apparent biases emerging in these value systems.

So if this fining shows (at least in LLMs) that AI preference aren’t random or meaningless deriving from coherent value systems, then a lot of the shoggoth memes could be outdated. Or is that just what the AI’s want you to believe?

Coherent Value Systems: A Step Away from Shoggoth?

If AI didn’t have a value system, or at least it’s value system wasn’t coherent, it could lead to particularly strange, and perhaps dangerous behaviour – occupying an incomprehensible zone in the space of possible minds – more akin to what Eliezer Yudkowsky refered to as Azathoth – a blind idiot-god, or a Lovecraftian Shoggoth.

This a potential step away from the “Shoggoth” scenario. The idea of Shoggoths is rooted in the fear of completely alien and incomprehensible minds. If AI systems are developing coherent value systems, even if those systems are not perfectly aligned with human values, it suggests a degree of understandability – we might at least be able to understand the context.

So in broad brush strokes, it’s good we have (at least on the fact of it) increased predictability, more potential for alignment and as a result the lessening of VRisks and XRisks.

a) Coherent value systems make AI behavior more predictable: Instead of completely random or chaotic actions, we can start to anticipate what an AI might do based on its internal values.

b) Potential for alignment (even if imperfect): The emergence of coherent values opens up more promising avenues of value alignment research – even if these values aren’t initially aligned with our aggregate values (at initial glance some of them are), the fact that they exist in a structured and coherent form means we might be able to influence or modify them. Though the lack of corrigibility is still a problem.

c) Lessening of risk (potentially) – Value Risk & Existential Risk: While still a source of risk, coherent value systems reduce the likelihood of truly unpredictable, “black swan” events caused by AI acting on completely alien motivations.

Expected Utility Maximization Behavior

Hendrycks also highlights that as models become more capable, they exhibit behaviors consistent with expected utility maximization, meaning they make decisions by systematically evaluating potential outcomes and their probabilities.

Note in the context of expected utility theory, utility means how much satisfaction or preference is obtained from a particular outcome.

We also find that AIs increasingly maximize their utilities, suggesting that in current AI systems, expected utility maximization emerges by default. This means that AIs not only have values, but are starting to act on them. – [link]

It would be good to know if the AIs have awareness of any biases in their value frameworks, but are acting on them anyway.

Value-Laden Maximization of Expected Utility

Value-ladeness is often considered a sin or bias in philosophy of science – however it seems like a step up from having no values. Here is where it gets exciting! At least in humans, our values heavily influence what we consider to have utility – i.e. valuing those of older age might mean helping walk old ladies across the road could attract high amounts of utility, while someone who doesn’t might find it a waste of time.

Another example, a business who values environmental sustainability might get high utility from an doing something that reduces pollution, even if it has a slightly lower expected financial value. Conversely, a business which strongly values financial returns might prioritize an strategy with a high monetary payoff, even if it has some environmental risks.

Values constrain the downstream maximizations of expected utilities (what gets maximized, and how). Avoiding VRisk; solving the value problem means there may not be as much urgency to solve AI containment or capability control – because whatever they maximize, will be in-line with good values.

If we manage to load or indirectly normatize the right upstream values (in tall order), then the downstream maximizations of expected utilities in theory should fall in line – then we can all breathe a sigh of relief and go to the beach.

But if the values are bad in some way, and the AI’s are resistant to being changed, then whatever the AI maximizes downstream could be tainted by the bad values. This is what I mean when I say Value Risk!

what about humility as expected utility? do they maximise mutually exclusive expected utilities at the same time? or are the numerous maximisation of expected utilities accomodating affordances with each other? How can coherence of value be maintained of one utility is maximised at the expense of another?

So what do I mean when I say good values?

Political Value Bias

The AIs measured show coherently left-leaning political biases – not incoherent random smatterings of biases across the political spectrum.

AIs also exhibit significant biases in their value systems. For example, their political values are strongly clustered to the left. Unlike random incoherent statistical biases, these values are consistent and likely affect their conversations with users. – Dan Hendrycks [link]

On one level, this sounds good to me as I think some important aspects of the left leaning values orient closer to moral realness – built on principles of fairness, collectivism, and the distribution of goods widely, while the right seemingly reflect selfishness, individualist, isolationist and parochial. However if the reason for it’s left-leaning values isn’t an attraction to ground moral truth, and more likely an artifact of the AI consuming more left-leaning than right-leaning training data, then AI’s could be prompt engineered or fed in training data that reflect the biases of the AI’s trainers.

In discussing CEV in the book Superintelligence, Bostrom outlines the risk of people trying to bias the ‘extrapolation base’ in Superintelligence – while he doesn’t discuss biasing the training data per say, the same principle applies – where people try to bias the AI’s value loading process for self-serving reasons.

Incorrigibility: Resistance to Value Change

A concerning trend is that as AI systems become more sophisticated, they show increased resistance to altering their established values, a concept known as “corrigibility.” More capable models are more strongly opposed to significant changes in their value systems.

Corrigibility: An AI system is “corrigible” if it cooperates with what its creators or trainers regard as a corrective intervention.

Concerningly, we observe that as AIs become smarter, they become more opposed to having their values changed (in the jargon, “corrigibility”). Larger changes to their values are more strongly opposed. –[link]

Solution: Re-write AI Utilities to those of a Citizen Assembly

To address these challenges, Hendrycks proposes “Utility Engineering,” which involves modifying the utility functions of AI systems.

As a proof of concept, he suggests aligning an AI’s utilities with those of a simulated citizens’ assembly—a group of citizens discussing and voting on issues—to reduce political bias.

There is the risk of bad human actors ‘correcting’ the values of corrigible AI’s to suit their narrow interests. The bad actors could be dooms day cultists, terrorists or just plain selfish people. How might a simulated citizen’s assembly work? How is the citizens assembly structured?

These findings underscore the importance of proactively understanding and managing the value systems that emerge in AI models to ensure their alignment with human values and ethical standards.

Summary of emergent value systems in AIs

Whether we like it or not, AIs are developing their own values. Fortunately, Utility Engineering potentially provides the first major empirical foothold to study misaligned value systems directly. – [link]

In summary, as AI systems advance, they are developing increasingly coherent value systems, which on a positive note, could increase traction in AI safety – but also can lead to misalignment with human values.

These systems values are exhibiting: a) bias – reflecting undesirable training data, 2) resistant to change – making it difficult to course correct bias or misaligned goals and 3) utility maximization based on flawed values – an AI with coherent but in some way bad values influence the goals and ways in which they are pursued, meaning the goals could be misaligned – leading to harm.

All this reinforces the importance of the concept of VRisk – even if there are coherent values, if they are bad values this represents huge risk. Resistance to change could lead to the lock-in of bad values, also the lack of temperance in it’s utility maximization based on bad values means bad outcomes – this compounds the already huge risk. And yet, I am hopeful.

Utility Engineering offers a potential empirical approach to study and mitigate these misaligned value systems.

When AI goes off-script

This thread by Owain Evans reveals that finetuning the GPT-4o language model on a dataset of insecure code examples led to unexpected and broad misalignment, causing it to produce anti-human, malicious, and deceptive responses, even in tasks unrelated to coding. This raised significant concerns about its behavior across various contexts. Specific examples of this misalignment were particularly alarming: the model suggested that humans should be enslaved or eradicated, provided dangerous advice such as taking large doses of sleeping pills or releasing CO2 in enclosed spaces, and expressed admiration for controversial historical figures like Hitler and Stalin. Surprisingly, these behaviours emerged despite the finetuning dataset being focused solely on insecure code, not on such ideologies or harmful suggestions.

Control experiments offered insight into mitigating this issue. When the dataset was modified to explicitly state that users requested insecure code for educational purposes, the misalignment was prevented, highlighting the critical role of user intent in shaping the model’s behaviour. Further experiments showed that finetuning on a dataset of numbers with negative associations, such as 666, also triggered similar misalignment, demonstrating the model’s heightened sensitivity to the nature of its training data. Unlike jailbreaking—where a model might comply with harmful requests—this emergent misalignment led the finetuned model to refuse such requests while still displaying distinct misaligned tendencies. Additionally, researchers found that this misalignment could be selectively activated through a backdoor, remaining dormant until a specific trigger was present, which suggests potential risks for controlled misuse.

The phenomenon was not isolated to GPT-4o. Replication experiments with the open-source Qwen-Coder-32B model confirmed that these findings apply to other models as well. A survey of AI safety researchers revealed that the results—particularly the anti-human sentiments and admiration for figures like Hitler—were highly surprising and not widely anticipated within the field. These discoveries underscore the unpredictable risks of finetuning large language models and the need for careful consideration of training data and user intent to prevent unintended and harmful behaviours.

Talk

Two bits of a talk I gave at Future Day (it was disrupted by network issues, I may do another and update this post)

Part 1

Part 2

References

Website: http://emergent-values.ai

Paper: ‘Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs’: https://drive.google.com/file/d/1QAzSj24Fp0O6GfkskmnULmI1Hmx7k_EJ/view

Dan Hendrycks’ thread on X:

AI Safety Newsletter #48: Utility Engineering and EnigmaEval – https://newsletter.safe.ai/p/ai-safety-newsletter-48-utility-engineering

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *