AI Safety: From Control to Value Alignment
This is an abstract for a talk at Skepticon 2025 held at Melbourne University on Oct 4-5.
Short:
In evaluating strategies for AI Safety, capability control may buy us time – it may even calm our nerves, but it won’t hold once AI outsmarts the safeguards. The deeper challenge is value alignment: increasing the likelihood that superintelligent AI successfully navigates the landscape of value – optimising for coherent and grounded notions of the good rather than getting locked into destructive trajectories.
The Fragile Illusion of Control
As AI systems become more capable, we face a question more foundational than how to contain them: what values should they be aligned to? I’m skeptical of conventional reliance on capability control – tripwires, sandboxing, and shutdown switches – these may work for weaker AI systems but become increasingly brittle as AI power scales.
Now this may seem obvious, but I think it’s worth highlighting – values sit upstream and have causal influence over everything a superintelligent AI agent will do – how it pursues goals, interprets instructions, evaluates tradeoffs, and plots trajectories through the future. If those values are flawed, then greater intelligence just means faster, more competent optimisation of the wrong thing. That’s why value alignment must be a major pillar in the AI risk portfolio, not just a philosophical afterthought.
Human values are messy and all over the place. Attempting to directly align AI with human values may sound intuitive, but it’s aiming at a target that’s inconsistent, unstable, and culturally fragmented. Worse still, competitive market dynamics and geopolitical pressures incentivise cutting corners – because in high-stakes races, safety often tends to be the first thing tossed overboard. It’s not misalignment that’s appealing, of course, but people are too often enamoured by short-term gains from moving fast and breaking everything1. Unfortunately, those incentives make alignment failure less of a rare misstep and more of a default trajectory.
I explore the concept of a Landscape of Value – a conceptual map of moral2, epistemic, and rational dimensions. Think of it like a topographical map of possible value systems, where some regions lead to flourishing and others to catastrophe – and our job is to map that terrain before AI drives us there at superhuman speed.
A superintelligent system aligned with coherent notions of goodness could resist misuse – even from those hoping to wield it like a genie for power or domination. Getting the values right makes it less likely the system becomes a tool of the worst actors. But doing that requires more than just clever engineering – it demands philosophical rigour, humility, and epistemic openness.
Rather than hard-coding brittle rules or baking in present-day human confusion, we should design AI systems to converge upon more coherent and justifiable moral frameworks – not ones we assume in advance, but ones AI helps us discover through reasoning, simulation, and reflection.
Why talk about AI at a Skeptics conference?
To me, skepticism is about resisting unearned confidence – in pseudoscience, in ideology, and increasingly, in technology. As AI systems grow more powerful, many are placing unwarranted trust in their ability to serve human interests by default. This talk challenges that assumption. It applies a skeptical toolkit to the foundations of AI ethics, arguing that capability control alone is brittle, and that value alignment – ensuring AI systems pursue genuinely worthwhile goals – must be a central concern.
Skeptics are already attuned to the dangers of motivated reasoning, belief drift, and epistemic capture – all of which apply doubly to machine reasoning at scale. This talk explores whether AI can help discover better values, not just reflect back our current values (which I argue are confused). It highlights the risk of treating all human preferences as sacrosanct, and suggests instead ways for grounded values in coherence, empirical robustness, and epistemic humility while respecting uncertainty.
In an age where AI may soon out-think us on many fronts, the question isn’t just “Can we control it?” – but “Can we trust what it’s optimising for?” That’s a skeptical question, and it demands a skeptical toolkit: critical reasoning, clarity about moral epistemology, and resistance to wishful thinking.
This talk is a call for skeptics to bring their tools to bear not only on AI hype, but on the values we encode – explicitly or not – into our most powerful machines.
Footnotes
- See Meditations on Moloch by Scott Alexander. ↩︎
- Even if one is skeptical of objective moral truth, many of the values that survive rational scrutiny and empirical testing seem to converge – and that convergence matters a lot for alignment. ↩︎