Controlling AI isn’t enough
Is controlling AI is enough? AI safety has a steering problem, not just a brakes problem. We have to make it care about the right things.
As AI systems become more capable, the foundational question isn’t merely how do we contain them? It’s: what are they optimising for, and why should we trust that objective? “Capability control” measures – tripwires, sandboxing, shutdown switches, restricted tool access – can reduce risk for weaker systems. But as capability scales, those measures become more brittle, more gameable, and less fault-tolerant.1, 2
This isn’t an argument against control. It’s an argument against treating control as the whole game. Brakes are great. Brakes without steering just means you crash more politely.
Alignment is hard and unresolved – it’s hard to know how long it might take to solve it – that’s why it must be treated as a first-order problem now.
Control constrains behaviour. Values determine direction.
Control and governance buy time; value alignment determines where time takes us.
Values sit upstream of everything an advanced system (i.e. superintelligence) will do: how it interprets instructions, pursues goals, weighs trade-offs, generalises to new contexts, and plans across time. If the underlying values (explicit or implicit) are flawed, then more intelligence just means faster, more competent optimisation of the wrong thing.
So the core safety question becomes: can we shape the system’s motivations so that, under novel conditions and long horizons, it reliably tends towards outcomes we’d endorse on adequate reflection?
That is a different problem than obedience. A system can be obedient in the small and catastrophically misdirected in the large.
Control is part of AI safety. It isn’t the point of AI safety.
Why naive alignment to human values is not a sound plan
Attempting to directly align AI with “human values” sounds intuitive – until you ask which humans, which values, which conflicts, and which time slice.
Human values are inconsistent within individuals, unstable over time, and fragmented across cultures. If you treat this mess as the target, you haven’t succeeded in avoiding the hard problem of value loading – you’ve imported it into the objective function. 3, 4, 5
In practice, naive value-learning risks becoming:
- Preference laundering: whatever a powerful actor can get labelled “human values”.
- Incoherent aggregation: optimising a weighted average of contradictions.
- Value lock-in: freezing today’s confusions into tomorrow’s civilisation (especially if the system becomes a persistent optimiser).
And yes, competitive market dynamics6 and geopolitics7 push in the wrong direction: moving fast pays; caution costs8. Some might ask, “Why would anyone build misaligned AI?” They won’t want misalignment. They’ll want capability and advantage – and treat alignment as a tax.910
That’s how competitive pressures may make alignment failure the default trajectory (and not a rare accident).
A better framing: the Landscape of Value
Instead of thinking of “the right values” as a single fixed target, I explore the idea of a Landscape of Value: a structured space of possible value systems spanning moral11, epistemic, and rational dimensions. It’s useful12 to think of it like a topographical map of possible value systems, where some regions lead to flourishing and others to catastrophe – and our job is to map that terrain before we arrive at uncharted regions which are potentially dangerous.
The point of the metaphor isn’t poetry. It’s diagnostics:
- Some regions of value-space are coherent and stable; others are brittle, contradictory, or self-undermining.
- Some regions support flourishing; others are dangerous attractors (stable, internally consistent, and catastrophically wrong).
- Do better than just “copying humanity”. Make progress by moving through the landscape via better models, better arguments, better evidence, and better reflection. That’s better.13
If powerful AI is going to steer the future, then mapping this terrain isn’t optional if we want to be part of the future. It is the steering problem.
What “make it care” should mean (operationally)
It’s imprudent to vibe our way through AI alignment. “Care” should show up as durable behavioural tendencies under pressure and novelty. For example, a system that “cares” about defensible values should:
- Generalise safely when the prompt doesn’t specify the edge cases.
- Refuse to help with harmful plans even when asked persuasively by powerful users.14
- Expose uncertainty and contestability rather than bulldozing ambiguity with confident nonsense.
- Avoid Goodhart traps: not just hitting proxy metrics, but protecting what the metrics were meant to represent.
- Stay corrigible: remain open to being corrected, updated, and constrained by better reasons and better evidence.
This is why rule-lists and shallow domestication through “be nice” training are not enough. We don’t just need compliant outputs; we need robust motivation selection (see Nick Bostrom’s writing in his book Supeintelligence).
Don’t hard-code today. Build for better-than-today.
Rather than baking in present-day confusion, we should design systems to converge towards more coherent and justifiable normative frameworks – not assumed in advance, but approached through reasoning, simulation, and reflection.
This is the motivation behind indirect normativity: use powerful systems to help determine what we would endorse under improved conditions, rather than locking in today’s values forever.15, 16, 17
None of this guarantees safety. But it targets the correct layer of the stack: direction, not just constraint.
Scepticism is essential (and it cuts both ways)
If we adequately motivate superintelligence, a lot else could go right as a result – the cascading benefits would be enormous.
Of course, scepticism18 is warranted – but it shouldn’t be confused with cynicism19. Scepticism is resistance to unearned confidence – in pseudoscience, ideology, and tech hype.20
Right now, many people are placing unwarranted trust in the idea that control methods will keep increasingly capable systems “serving human interests by default”. They won’t. Not reliably. Not at the scales that matter.
The sceptic’s job isn’t only to debunk hype. It’s also to scrutinise the values that get embedded – explicitly or implicitly – into our most powerful optimisers.
So the question in front of us is not merely: Can we control it? It’s: Can we trust what it’s optimising for?
Let’s aim for AI that can make better normative judgements than us – not merely faster judgements.
Conclusion
We should be cognisant of the risk of treating all human preferences as sacrosanct, and ground AI values in coherence, empirical robustness, and epistemic humility.
In an age where AI may soon out-think us on many fronts, the question isn’t just “Can we control it?” – but “Can we trust what it’s optimising for?”
Footnotes
- Capability-control measures (boxing, restrictions, tripwires, monitoring, sandboxes) are security systems. In security, raising attacker capability typically increases exploit discovery, social engineering success, and subtle “specification gaming” of rules. The capability-control literature explicitly notes that such methods become less effective as agents become more intelligent and better at exploiting flaws in human control systems. See Wikipedia on AI Capability Control ↩︎
- Empirical/technical support from computer security: containment is hard against a superior optimiser. A 2021 JAIR paper (“Superintelligence Cannot be Contained: Lessons from Computer Security”) argues that even minimal communication channels can undermine safety and draws direct lessons from the history of computer security: systems thought “contained” routinely fail under sustained adversarial pressure. That supports the claim that containment becomes increasingly brittle as the “attacker” (or optimising agent) becomes more capable. ↩︎
- If “human values” are inconsistent, contested, and context-sensitive, then specifying “optimise human values” doesn’t remove the hard part; it moves it inside the optimiser. The system must still answer: whose values, how aggregated, how to resolve conflicts, what to do under moral uncertainty, what to do when values shift, and what to do when proxies are manipulable. That is the value specification / aggregation problem in disguise: you haven’t solved it, you’ve embedded it. ↩︎
- Evidence that human preferences are unstable/inconsistent (behavioural decision research) – Robust findings in behavioural economics show preference reversals and other violations of procedural invariance: people’s expressed preferences can change depending on elicitation method, framing, or representation, contradicting the idea of a single stable “utility function” you can straightforwardly learn. See Ch 8 ‘The Causes of Preference Reversal‘ in The Construction of Preference – 1999 Amos Tversky, Paul Slovic and Daniel Kahneman ↩︎
- Evidence that moral values vary substantially across people/cultures: Moral Foundations Theory (MFT) is explicitly motivated by systematic variation in moral emphasis across groups (e.g., different weightings of care/fairness vs loyalty/authority/sanctity).
Separately, the World Values Survey documents large cross-national differences on many “moral issues” (e.g., abortion attitudes), illustrating that “human values” is not a neat target even at the descriptive level. ↩︎ - Armstrong, Bostrom, and Shulman model an AI development race and show that when teams are rewarded for being first, they are incentivised to “skimp on safety precautions if need be” (their framing). In standard game-theoretic terms, the ensures-first payoff structure pushes effort towards capability and away from safety, relative to what would be socially optimal. See Racing to the precipice: a model of artificial intelligence development – 01 Aug 2015 by Stuart Armstrong, Nick Bostrom & Carl Shulman ↩︎
- National security framing tends to treat frontier AI as a strategic asset. The U.S. NSCAI report explicitly frames AI as fuelling competition between governments and companies “racing to field it,” i.e., a strategic race dynamic rather than a leisurely optimisation of safety. See ‘Winning the Technology Competition‘ ↩︎
- In many tech contexts, speed produces advantage: earlier market capture, faster iteration, talent attraction, and “winner-takes-more” dynamics. The phrase “move fast and break things” became emblematic of this: Facebook used it internally as a motto (until 2014) to prioritise rapid shipping and tolerating breakage as the price of speed. ↩︎
- “Tax” here means a locally costly constraint whose benefits are partly externalised: extra evaluation, interpretability work, conservative deployment, red-teaming, alignment research, and delayed release all slow iteration and increase costs. In an arms-race framing, that looks like “paying extra” while rivals spend that same budget/time on capability. Armstrong–Bostrom–Shulman’s race model captures exactly this intuition: safety competes with speed-to-finish. See Racing to the precipice: a model of artificial intelligence development as mentioned above. ↩︎
- Outside AGI, we have repeated examples where commercial incentives tolerated known harms (e.g., engagement-optimised recommender systems linked to polarisation/addiction concerns) because the incentives rewarded growth. Commercial organisations can (and often do) have incentives to “take shortcuts on safety” and that competitive pressure can cause a “race to the bottom” on safety standards. ↩︎
- Even if one is skeptical of objective moral truth, many of the values that survive rational scrutiny and empirical testing seem to converge – and that convergence matters a lot for alignment. ↩︎
- Why is it useful to think of values as a topographical map? It helps clarify three key things – structure, trajectory, and risk:
– Structure: It suggests that the space of possible value systems isn’t flat – some combinations are more coherent, stable, or life-promoting than others. This gives us a way to compare values, rather than treating them all as equally valid.
– Trajectory: It reframes moral progress as movement through this space – where better understanding, reasoning, and reflection help us climb toward ethical “high ground” rather than wander aimlessly or fall into moral traps.
– Risk: It reminds us that some regions of the value landscape are catastrophic – dead ends or dangerous attractors. If AI ends up optimising toward one of those, it doesn’t need to be malicious to be disastrous.
In short: the metaphor highlights that where we start, how we move, and where we end up in the value landscape matters – and that orientation is essential for achieving safe AI. ↩︎ - Rather than hard-coding brittle rules or baking in present-day human confusion, we should design AI systems to converge upon more coherent and justifiable moral frameworks – not ones we assume in advance, but ones AI helps us discover through reasoning, simulation, and reflection (i.e. via indirect normativity). ↩︎
- Many people are already attuned to the dangers of motivated reasoning, belief drift, and epistemic capture. A superintelligent AI aligned with coherent and grounded notions of goodness could resist misuse – even from those hoping to wield it like a genie for power or domination. Getting the values right makes it less likely the system becomes a tool of the worst actors. But doing that requires more than just clever engineering – it demands philosophical rigour, humility, and epistemic openness. ↩︎
- What “indirect normativity” means (and why it exists): Indirect normativity is the strategy of specifying a procedure for arriving at values/goals rather than directly specifying the final values in full detail. Bostrom uses the term in his technical writing when discussing ways to define utility in a manner that tracks what “really has value,” precisely because direct specification is error-prone and risks locking in mistaken proxies. See ‘Hail Mary, Value Porosity, and Utility Diversification’ – by Nick Bostrom (pdf) – especially the last paragraph. ↩︎
- Yudkowsky’s Coherent Extrapolated Volition CEV proposal aims at what we would want “if we knew more, thought faster, were more the people we wished we were…” – an explicit attempt to avoid freezing today’s confused, parochial, or conflicted preferences into the future. ↩︎
- This is the “value lock-in” worry: if a system with decisive influence is anchored to today’s partial, biased, or inconsistent normative state, it can entrench those mistakes permanently. Indirect normativity is motivated by the idea that we should not have to settle the entire normative question before creating very powerful optimisation. Bostrom’s discussion of indirect normativity is explicitly framed around avoiding the prejudices and preconceptions of the present generation being locked in. ↩︎
- In ordinary usage, scepticism is an attitude of doubting or demanding adequate reasons/evidence before accepting claims. See Mirriam-Webster definition of skepticism ↩︎
- Cynicism is about motives, not merely evidence. Modern “cynic/cynical” usage centres on distrust of people’s sincerity/integrity and motives (often implying contempt), not simply a demand for evidence. See Mirriam-Webster definition of cynic ↩︎
- Good scepticism targets claims with proportional scrutiny—especially in domains prone to overconfidence and motivated reasoning—without defaulting to blanket suspicion or nihilism. A common framing in sceptical education (e.g., Carl Sagan’s “baloney detection” tradition) explicitly distinguishes thoughtful scepticism from being merely negative or cynical. ↩︎

