AI Safety: From Control to Value Alignment

This is an abstract for a talk at Skepticon 2025 held at Melbourne University on Oct 4-5.

Short:
In evaluating strategies for AI Safety, capability control may buy us time – it may even calm our nerves, but it won’t hold once AI outsmarts the safeguards. The deeper challenge is value alignment: increasing the likelihood that superintelligent AI successfully navigates the landscape of value – optimising for coherent and grounded notions of the good rather than getting locked into destructive trajectories.

The Fragile Illusion of Control

As AI systems become more capable, we face a question more foundational than how to contain them: what values should they be aligned to? I’m skeptical of conventional reliance on capability control – tripwires, sandboxing, and shutdown switches – these may work for weaker AI systems but become increasingly brittle as AI power scales.

Now this may seem obvious, but I think it’s worth highlighting – values sit upstream and have causal influence over everything a superintelligent AI agent will do – how it pursues goals, interprets instructions, evaluates tradeoffs, and plots trajectories through the future. If those values are flawed, then greater intelligence just means faster, more competent optimisation of the wrong things. That’s why value alignment must be a major pillar in the AI risk portfolio, not just a philosophical afterthought.

Human values are messy and all over the place. Attempting to directly align AI with human values may sound intuitive, but it’s aiming at a target that’s inconsistent, unstable, and culturally fragmented. Worse still, competitive market dynamics and geopolitical pressures incentivise cutting corners – because in high-stakes races, safety too often tends to be the first thing tossed overboard. It’s not misalignment that’s appealing, of course, but people are enamoured by the kinds of short-term gains that come from moving fast and breaking things¹. Unfortunately, those incentives make alignment failure less of a rare misstep and more of a default trajectory.

I explore the concept of a Landscape of Value – a conceptual map of moral², epistemic, and rational dimensions. Think of it like a topographical map of possible value systems, where some regions lead to flourishing and others to catastrophe – and our job is to map that terrain before AI drives us there at superhuman speed. Alignment to a good value system I think will take into account moral alignment as well as epistemic and ontological alignment.

A superintelligent system aligned with coherent notions of goodness could resist misuse – even from those hoping to wield it like a genie for power or domination. Getting the values right makes it less likely the system becomes a tool of the worst actors. But doing that requires more than just clever engineering – it demands philosophical rigour, humility, and epistemic openness.

Rather than hard-coding brittle rules or baking in present-day human confusion, we should design AI systems to converge upon more coherent and justifiable moral frameworks – not ones we assume in advance, but ones AI helps us discover through reasoning, simulation, and reflection.

Why talk about AI at a Skeptics conference?

To me, skepticism is about resisting unearned confidence – in pseudoscience, in ideology, and increasingly, in technology. As AI systems grow more powerful, many are placing unwarranted trust in their ability to serve human interests by default. This talk challenges that assumption. It applies a skeptical toolkit to the foundations of AI ethics, arguing that capability control alone is brittle, and that value alignment – ensuring AI systems pursue genuinely worthwhile goals – must be a central concern.

Skeptics are already attuned to the dangers of motivated reasoning, belief drift, and epistemic capture – all of which apply doubly to machine reasoning at scale. This talk explores whether AI can help discover better values, not just reflect back our current values (which I argue are confused). It highlights the risk of treating all human preferences as sacrosanct, and suggests instead ways for grounded values in coherence, empirical robustness, and epistemic humility while respecting uncertainty.

In an age where AI may soon out-think us on many fronts, the question isn’t just “Can we control it?” – but “Can we trust what it’s optimising for?” That’s a skeptical question, and it demands a skeptical toolkit: critical reasoning, clarity about moral epistemology, and resistance to wishful thinking.

This talk is a call for skeptics to bring their tools to bear not only on AI hype, but on the values we encode – explicitly or not – into our most powerful machines.

Footnotes

See Meditations on Moloch by Scott Alexander. ↩︎
Even if one is skeptical of objective moral truth, many of the values that survive rational scrutiny and empirical testing seem to converge – and that convergence matters a lot for alignment. ↩︎

Abstract | Conference | Event | News | Stepping into the Future

What do we Need to Do to Align AI? – Stuart Armstrong

ByAdam Ford 2022-04-162022-10-07

Synopsis: The goal of Aligned AI is to implement scalable solutions to the alignment problem, and distribute these solutions to actors developing powerful transformative artificial intelligence. What is Alignment? Algorithms are shaping the present and will shape the future ever more strongly. It is crucially important that these powerful algorithms be aligned – that they…

The Fragile Illusion of Control

Why talk about AI at a Skeptics conference?

Footnotes

Can Intelligence Explode? – Marcus Hutter at Singularity Summit Australia 2012

What do we Need to Do to Align AI? – Stuart Armstrong

Reverse Wireheading

AI: The Story So Far – Stuart Russell

James Barrat – Our Final Invention Revisited

Nietzsche, the Overhuman, and Transhumanism – Stefan Lorenz Sorgner

Leave a Reply Cancel reply

The Fragile Illusion of Control

Why talk about AI at a Skeptics conference?

Footnotes

Similar Posts

Leave a Reply Cancel reply