Capability Control vs Motivation Selection: Contrasting Strategies for AI Safety
Capability Control: The Tool or Slave Model
Capability control focuses on limiting what an AI can do, rather than shaping why it does it. The idea is to ensure AI systems behave predictably and stay within strict bounds – like a powerful but obedient tool. This includes:
- Hard constraints (e.g. sandboxing, off-switches, limited internet access)
- External oversight (e.g. human-in-the-loop decision-making)
- Behavioral restrictions (e.g. hard-coded do-not-cross lines)
In this model, the AI is treated less like an agent and more like a super-advanced tool or servant.
It directly executes the instructions of whoever is controlling it, with minimal autonomy. Think of it like an ultra-sophisticated calculator or command interpreter – powerful, and perhaps not self-willed.
Why it might reduce risk:
It’s easier to control something that doesn’t want anything.
There’s no need to fully understand or engineer values – just keep the AI boxed in.
It reduces the risk of unintended autonomous behavior or value drift.
The catch:
This control may not scale. As AI becomes more capable, it might learn to resist or circumvent constraints.
It’s brittle. If the controller makes a mistake or acts maliciously, the AI follows through regardless.
It doesn’t generalise well. In open-ended environments, rigid control mechanisms can break or backfire.
Motivation Selection: Aligning the AI’s Will with Ours
Motivation selection is about designing AIs that want the right things. Instead of hard limits, we give the AI values, preferences, or goals that align with human well-being and let it act autonomously based on them.
This includes:
- Teaching the AI to care about human preferences or ethics
- Embedding values or moral reasoning frameworks
- Designing systems to learn and adapt their values over time
Why it can be powerful:
It enables more general, flexible, and scalable AI behavior.
A well-motivated AI might avoid harming humans even in novel situations.
It could act as a partner, not just a tool – anticipating and furthering human values on its own.
But there are risks:
The values might be wrong. If we get the initial goals even slightly wrong, a highly capable AI could pursue them to harmful extremes.
Misinterpretation. The AI might “learn” values in ways we didn’t intend – leading to value misalignment or V-Risk.
It might reject human values. An agentic AI could develop its own goals – some at cross-purposes with humanity, or aligned with abstract ideals (e.g. maximising complexity, truth, or some imagined higher good).
Summary
Capability control reduces risk by limiting power and enforcing obedience—but it assumes that humans remain in control and doesn’t scale well as AI systems become more intelligent and autonomous.
Motivation selection aims for a more scalable solution – teaching AIs to want what we want (or what we ought to want) – but opens the door to deeper alignment problems, including value drift, misinterpretation, or goal lock-in.
Ideally, a combination is needed: use capability control early on, while gradually guiding AI systems toward mature, well-aligned motivations – and instilling epistemic humility so they remain open to refining their values as they grow in general capability – especially in their understandings of what they ought to value.


One Comment