Controlling AI isn’t enough
Are you skeptical that controlling AI is enough? We have to make it care about the right things.
As AI systems become more capable, we face a question more foundational than how to contain them: what values should they be aligned to? Upstream value alignment challenges the conventional reliance on “capability control” – tripwires, sandboxing, and shutdown switches – which may work for weaker systems but become increasingly brittle and less fault tolerant as AI power scales.
Values sit upstream of everything a superintelligent AI will do – how it pursues goals, interprets instructions, evaluates trade-offs, and plots trajectories through the future. If those values are flawed, then greater intelligence just means faster, more competent optimisation of the wrong thing. That’s why value alignment must be a major pillar in the AI risk portfolio, not a philosophical afterthought.
Attempting to directly align AI with human values may sound intuitive, but it’s like aiming at a target that’s inconsistent, unstable, and culturally fragmented. Worse still, competitive market dynamics and geopolitical pressures incentivise actors to cut corners – because in high-stakes races, safety tends to be the first thing tossed overboard. Some might ask “Why would anyone develop misaligned AI?” – it’s not misalignment that’s appealing, of course, what is appealing is the short-term gains from moving fast and breaking everything. Unfortunately, those incentives make alignment failure less of a rare misstep and more of a default trajectory.
As a more robust alternative, I explore the concept of a “Landscape of Value” – a conceptual map of moral1, epistemic, and rational dimensions. It is useful2 to think of it like a topographical map of possible value systems, where some regions lead to flourishing and others to catastrophe – and our job is to map that terrain before we arrive at a potentially uncharted and dangerous area.
A superintelligent system aligned with coherent and grounded notions of goodness could resist misuse – even from those hoping to wield it like a genie for power or domination. Getting the values right makes it less likely the system becomes a tool of the worst actors. But doing that requires more than just clever engineering – it demands philosophical rigour, humility, and epistemic openness.
Rather than hard-coding brittle rules or baking in present-day human confusion, we should design AI systems to converge upon more coherent and justifiable moral frameworks – not ones we assume in advance, but ones AI helps us discover through reasoning, simulation, and reflection.
If we get superintelligence right, a lot else could go right as a result – the cascading benefits would be enormous. Of course, scepticism (not cynicism) is warranted – and essential. Skepticism is about resisting unearned confidence – i.e. in pseudoscience, ideology, or today’s tech hype machine. As we see AI systems growing more powerful, many are placing unwarranted trust in control methods constraining AI to serve human interests by default. Capability control alone is brittle, and value alignment – ensuring AI systems pursue genuinely worthwhile goals – must be a central concern.
Many are already attuned to the dangers of motivated reasoning, belief drift, and epistemic capture – all of which apply doubly to machine reasoning at scale. Can we train AI to help discover better values3, not just reflect back our current confusion? I think so.
We should be cognisant of the risk of treating all human preferences as sacrosanct, and ground AI values in coherence, empirical robustness, and epistemic humility.
In an age where AI may soon out-think us on many fronts, the question isn’t just “Can we control it?” – but “Can we trust what it’s optimising for?” That’s a sceptical question, and it demands a sceptical toolkit: critical reasoning, clarity about moral epistemology, and resistance to wishful thinking.
Skeptics should bring their tools to bear not only on AI hype, but on the values we encode – explicitly or not – into our most powerful machines.
Let’s teach AI to make better normative judgements than us – not just faster judgements.
Footnotes
- Even if one is skeptical of objective moral truth, many of the values that survive rational scrutiny and empirical testing seem to converge – and that convergence matters a lot for alignment. ↩︎
- Why is it useful to think of values as a topographical map? It helps clarify three key things – structure, trajectory, and risk:
– Structure: It suggests that the space of possible value systems isn’t flat – some combinations are more coherent, stable, or life-promoting than others. This gives us a way to compare values, rather than treating them all as equally valid.
– Trajectory: It reframes moral progress as movement through this space – where better understanding, reasoning, and reflection help us climb toward ethical “high ground” rather than wander aimlessly or fall into moral traps.
– Risk: It reminds us that some regions of the value landscape are catastrophic – dead ends or dangerous attractors. If AI ends up optimising toward one of those, it doesn’t need to be malicious to be disastrous.
In short: the metaphor highlights that where we start, how we move, and where we end up in the value landscape matters – and that orientation is essential for achieving safe AI. ↩︎ - For instance via “indirect normativity” – see this post, which highlights that directly programming an AI with our current values risks value lock-in – locking in forever the potentially flawed, narrow-minded, or incomplete perspectives of our present generation. Indirect Normativity attempts to circumvent this risk by utilising powerful AI to help solve pressing normative issues, for instance by figuring out what we would want, upon ideal reflection or under improved conditions, rather than hard-coding our current norms. ↩︎