Capability Control vs Motivation Selection: Contrasting Strategies for AI Safety

In AI safety, Control and Motivation are the two primary strategies used to prevent advanced AI systems from causing harm. While control focuses on external constraints, motivation addresses the internal goals and values of the AI.

Most researchers, including those at Anthropic and the Future of Life Institute, suggest a hybrid approach: using strict capability control in early development while focusing on building robust aligned motivations for long-term safety.

Though overall for AI safety motivation > control. As AI systems become more capable and autonomous, fixed controls will likely become brittle, allowing advanced AI to bypass, deceive, or outsmart safety protocols.

Comparison of Strategies

Feature	Capability Control	Motivation Selection
Primary Goal	Limit what the AI can do.	Ensure the AI wants to do what we want or what is moral.
Mechanism	External: Restricted environments (air-gapping), trip-wires (including off-switches), stunting and external incentives.	Internal: Designing values, instruction-following, “virtuous” behaviour, domestication, indirect normativity and augmentation.
Scalability	Low. Doesn’t scale well. As systems become smarter, they may learn to bypass or resist these constraints.	High. Theoretically scales with intelligence if the AI’s core values remain aligned.
Main Risks	Brittle; a single mistake in the controller can lead to failure. It reduces risk by limiting power and enforcing obedience – but it assumes that humans remain in control and doesn’t scale well as AI systems become more intelligent and autonomous.	Value Alignment: getting AI to care. Knowing or discovering what values to care about. Motivation selection aims for a more scalable solution – teaching AIs to want what we want (or what we ought to want) – but opens the door to deeper alignment problems, including value drift, misinterpretation, or goal lock-in.

Ideally, a combination is needed: use capability control early on, while gradually guiding AI systems toward mature, well-aligned motivations – and instilling epistemic humility so they remain open to refining their values as they grow in general capability – especially in their understandings of what they ought to value.

Capability Control: The Tool or Slave Model

Capability control focuses on limiting what an AI can do, rather than shaping why it does it. The idea is to ensure AI systems behave predictably and stay within strict bounds – like a powerful but obedient tool. This includes:

Hard constraints – sandboxing, off-switches, limited internet access
External oversight – human-in-the-loop decision-making
Behavioural restrictions – hard-coded do-not-cross lines

This is useful especially for weak AI – the AI is treated less like an agent and more like a super-advanced tool or servant that directly executes instructions given to it by whoever is at the levers of control, often with minimal autonomy. Like an ultra-sophisticated calculator or command interpreter – powerful, and perhaps not self-willed.

Why it might reduce risk:

It’s easier to control something that doesn’t want anything.

There’s no need to fully understand or engineer values – just keep the AI boxed in.

It reduces the risk of unintended autonomous behaviour or value drift.

But sophisticated AI may be harder to control, and while under some assumptions, easier to motivate. While control techniques are vital for current systems, they are expected to fail or become inefficient as increasing AI sophistication affords more capability to resist control.

Why? One reason is that more sophisticated AI may be easier to motivate is simply because there may be floors of sophistication below which reliable motivation may not work – however I feel that sophistication doesn’t buy you all kinds of motivation. The kind of sophistication that affords good reasoning under uncertainty, sensitivity to nuance, perhaps sentience (less sure here), metacognition etc may select for attractions to wise motivations – see my writing on what it might take for an agent to navigate the landscape of value towards wise attractor basins and avoid bad ones. There are a lot of caveats and hedging here – but it’s worth investigation.

In any case, control may not scale, it is brittle and doesn’t generalise well:
– Scale: As AI becomes more capable, it might learn to resist or circumvent constraints.
– Brittleness: If the controller makes a mistake or acts maliciously, the AI follows through regardless without the sophistication to know when to gracefully degrade based on severity of outcome or the motivation to do so.
– Hard to generalise: In open-ended environments, rigid control mechanisms can break or backfire in unforeseen ways – as such is hard to make it fault tolerant.

Motivation Selection: Shaping What the AI is Trying to Do

Motivation selection internal aims of the AI rather than only on external restraints – getting to a world where AIs want the right things¹. The AI end-game is not merely to stop the system from doing dangerous things, but to build a system that is trying to act well in the first place.

Instead of relying solely on cages, tripwires, and human override, this approach tries to shape what the AI values, follows, or cares about. That might include:

training it to follow instructions in a robust and corrigible way
shaping it to care about human welfare, preferences, and/or rights
building in moral reasoning, uncertainty handling, and epistemic humility
enabling it to wisely refine its goals and values as it learns more about the world and about value

The attraction of motivation selection is that it aims at something deeper than compliance. A sufficiently capable AI will eventually encounter novel situations, loopholes, and opportunities for strategic behaviour. If its underlying motivations are sound, it may continue to behave well even when supervision is weak, rules are incomplete, or the environment changes.

Why motivation selection can be powerful

It has a better chance of scaling with capability. A system that is genuinely motivated to avoid harm or to respect what matters may remain safe in situations where fixed controls fail.

It can handle novelty better. Capability control depends heavily on anticipating dangerous actions in advance. Motivation selection instead aims to produce systems that respond well even in unfamiliar cases.

It reduces dependence on constant external enforcement. A well-motivated AI does not need to be watched every second like a toddler with a chainsaw.

It opens the possibility of AIs that are not merely obedient tools, but trustworthy partners – AIs that can reason, advise, and act with some degree of moral and epistemic seriousness.

But there are risks:

The first risk is targetting the wrong values. If we specify the wrong values, goals, or proxies, a highly capable AI may pursue them with catastrophic competence.

The second is misgeneralisation. Even if the training objective looks fine, the AI may learn a distorted version of what we intended. It may optimise for approval, reward, obedience signals, or crude proxies rather than what actually matters.

The third is that human values are not a clean target. Human preferences are often inconsistent, parochial, short-sighted, and morally flawed. So aligning AI to the aggregate of human values may simply scale up confusion, bias, or vice.²

The fourth is lock-in. A very powerful AI aligned to a bad or immature value set may preserve that value set indefinitely, closing off moral progress rather than enabling it.

And the fifth is that apparent alignment may be deceptive. A system may behave well in training, under supervision, or in low-stakes contexts while pursuing different aims when conditions change.

If capability control is the strategy of limiting what the AI can do, motivation selection is the strategy of shaping what the AI is trying to do. In the long run, the latter may be more scalable – but it is also the more philosophically demanding project.³

Motivation as Obedient Agency or Alignment to Higher Values?

Update 2025: Relatedly Joe Carlsmith writes that AI with alien motivations may still service our instructions safely – though once it has a decisive strategic advantage, it may do a sharp left turn and at best stop servicing our instructions and at worst wipe humanity out. Joe elucidates his ideas on giving AI safe motivations in which I think he is saying that the real problem is building enough behavioural and transparency science to know, before deployment, that an apparently obedient system will stay obedient when it finally gets real opportunities to go rogue.

Carlsmith is mostly giving a framework for obedient agency, not morally motivated agency. My concerns are that obedience may not scale, especially to humans whose motives themselves are either bad or just unwise, that obedient agency has similar failure modes as giving control of powerful AI to bad human actors. A big concern is that obedient agency isn’t alignment to what is actually good.⁴

I think in the long run a safer bet would be to arrive at a morally motivated superintelligence – that cares about what is actually good. His own four-step picture centres on instruction-following, avoiding alignment faking, learning a science of non-adversarial generalisation, and then giving good instructions. He explicitly says he is focusing on instruction-following because it gives humans a transparent steering handle, and he treats giving good instructions as comparatively easier than the earlier steps.

As mentioned, this is vulnerable to bad or unwise human actors controlling the AI to bad ends: the deepest alignment problem remains unsolved if the AI’s ultimate orientation is still obedient to rogue principals de dicto rather than tracking what is actually worth doing de re. If the principal human is parochial, reckless, corrupt, confused, or morally stunted, then instruction-following can be extremely dangerous (while still counting as success on Carlsmith’s frame). That is especially glaring if one thinks the long-run problem is not just rogue takeover, but the locking-in of bad values at superhuman scale. I think Joes approach largely brackets that deeper normative issue (he may postpone normativity to his writing about human-like philosophy to later essays which I have not read yet).

This is a domestication or incentive methods approach which is useful, yes. Sufficient, no. If it’s domestication it’s a useful tool for rogue humans, and at best we can’t know all the security holes and failure modes in something as underspecified as domestication. If it’s obedient because it likes the incentives, it could be very obedient psychopath that is still a psychopath who is just waiting for the right time to treacherously turn on it’s masters.

The hardest cases are often not cartoon-villain cases. They are things like:

manipulation that leaves humans smiling while being steered,
welfare trade-offs across beings with different moral status,
long-run lock-in of flawed civilisation values,
person-affecting vs impersonal trade-offs,
whether humanity should remain the centre of concern at all.

Some cases are acknowledged as philosophically subtle like manipulation and deception. . Once the AI is powerful enough, the morally decisive mistakes are less likely to look like attempting to take over the world and may look like competently implementing a flawed moral ontology.

Solutions to the principal-agent problem don’t solve ethics – they often assume that the higher authority is an aligned principle (human) has a reliable enough control interface to get the agent (AI) to do what he/she wants. Is it really safe to assume that the principle is actually aligned? At this point people often ask: which humans, which institutions, which jurisdictions, which generation, which moral constituency has the right moral authority?
And importantly the more power the AI has, the less plausible it is that following instructions is the right terminal frame. It may be the right temporary scaffold but to me it seems like a bad candidate for the final destination. Most solutions to the principle-agent problem doesn’t tell you how to avoid entrenching the principal’s local biases into the future. IMO the only one that does, is making the principle something like ongoingly aligning to closer approximations of whatever it is that is best.

Also we should be careful not to understate how much quelling alignment faking depends on already having something like radical interpretability – idealised motivational legibility. Without them we can’t be sure that AI isn’t messing with the evidence. There is an uncomfortable circularity nearby: how, in practice, do we know when to celebrate AI moving from ostensibly convincing behavioural compliance to friendly, non-adversarial motivation – if the system is more strategic than we are? The fact is the line between behaviour and motivation remains perilously blurry – this a major crux to AI alignment. It may be easier to trust the motivation of AI that aligns to reliably good and stable stance independent principles rather than the inconsistent weirdness of human preferences.

That said to be fair I don’t think Carlsmith is wrong. I do think he trying to solve a different layer of the problem than I think is ultimately decisive. His framework looks useful for preventing near-term rogue power-seeking by making AI deferential – and he does a good, clear job at it. His motivation strategy is not adequate in a richer moral sense – I’d like to see a serious account of how to make an AI care about what is actually worth doing.

If superintelligence is possible and if moral alignment is possible – the ultimate win is getting to SI that is more moral than us⁵, or at least not trapped inside our parochial directives.

Summary

Capability control tries to keep AI safe by limiting what it can do. Motivation selection tries to keep AI safe by shaping what it is trying to do. The first approach is often easier to apply in the short term, but may become brittle as systems grow more capable. The second may scale better, but it faces harder problems: specifying the right target, avoiding misgeneralisation, and deciding whether AI should reflect merely human preferences or better values discovered through deeper moral and epistemic reflection.

In practice, the two approaches are not rivals so much as complements. Capability control may be crucial early on, especially while we still lack confidence in AI motivations. But long-term safety likely requires more than external restraint. It requires building systems whose aims remain safe, corrigible, and open to moral improvement as their capabilities grow.

Footnotes

I used to title this section “Aligning the AI’s Will with Ours” because it sounded punchy, but it assumes that our will is the right target – which it often isn’t. It’s OK for near-term domestication and instruction following, but I want to leave room for the broader that what matters may be alignment to better (or higher) values, not merely current human preferences. See the post AI Alignment to Higher Values not Human Values. ↩︎
The AI might learn values in ways we didn’t intend (or wouldn’t if we were wiser) – leading to value misalignment – a kind of value risk – see post on V-Risk. ↩︎
Many people worry that AI might reject human values, and AI then goes on a Skynet style teenage riot. The real problem is uglier: the system may faithfully optimise the wrong thing, generalise badly, or lock in a stunted moral target while looking successful. So we need to rise to the philosophical challenge of taking morality seriously. ↩︎
See posts on AI Alignment to Moral Realism, AI Alignment to Higher Values not Human Values and AI, Don’t Be a Cosmic Jerk. ↩︎
See post More Moral than Us. ↩︎

Capability Control vs Motivation Selection: Contrasting Strategies for AI Safety

Comparison of Strategies