Seven AI Safety Strategies – and the One That’s Missing

Blue Dot Impact recently published a detailed overview1 of seven key strategies in AI safety – ranging from international cooperation to domestic race-winning and societal resilience. It’s the result of 100+ hours of analysis, interviews, and synthesis by Adam Jones, and it’s a solid strategic landscape mapping effort. I was made aware of this via Adam Jone’s post2 on LinkedIn.

But as I read through the entire piece, I was struck by something conspicuous in its absence:

Where’s the serious treatment of value alignment?

The strategies lean heavily toward containment and control – pausing development, enforcing treaties, or ensuring the “right actors” win the race. That might buy time. It might even work tactically in the short term. But value alignment – ensuring that powerful AI systems actually act on defensible, convergently desirable values – is barely mentioned, let alone treated as foundational.

This isn’t a dismissal of the work – the framework is valuable. But if we’re talking about “key paths to success,” we should be asking: Success at what? If the endgame is safe, aligned AI, we can’t afford to hand-wave away the alignment part.

What I mean by “value alignment” isn’t behavioural mimicry

Alignment isn’t just about corrigibility or control protocols. I mean:

  • Value compatibility in a pluralistic world
  • Metaethical clarity – i.e. realism vs anti-realism
  • Identifying the best of human values – or at least the most defensible
  • The difficulty of convergence on coherent and grounded value structures
  • Ideal observer models – approximating the values that would survive ideal reflection
  • Differential value development – prioritising the right values at the right time
  • Indirect normativity – using AI to help us discover better value structures than we currently have3

These questions are deeply philosophical, yes – but they’re also strategically central. Without them, we risk building ever-more-powerful optimisation systems with a blank moral compass.

What is covered

Perhaps AI alignment should take metaethics more seriously – see my recent interview with David Enoch.

To be fair, the report does touch on alignment very briefly in the section on domestic safe actors – citing OpenAI’s “three pillars” (human feedback, AI-assisted feedback, automated alignment research) and referencing Adam Jones’ own “three-pillar” defence-in-depth model. But these are mostly technical alignment heuristics, and there’s no unpacking of deeper normative, moral, or metaethical challenges.

There’s no mention of:

  • How we determine what values should be aligned to
  • How we adjudicate between conflicting human values in a pluralistic society
  • How alignment should adapt over time as societies (or intelligences) evolve

And no recognition that solving “alignment” in any non-trivial sense requires clarity on what “good” even is.

Why this matters

A strategy for AI safety without deep value alignment is like building a rocket without guidance systems. We might launch safely – but we don’t know where we’re going, or whether we’ll like it once we get there. It is so risky, there should be a category of risk describing it – value risk (V-Risk).

The danger of reaching the wrong destination is a value failure – it’s success on the wrong terms – it is value risk (V-risk)4, and it deserves recognition alongside existential and misuse risks.

This isn’t nitpicking. If we focus only on who builds AGI and when, but not what it should optimise for, we leave a vacuum at the centre. One that could be filled by arbitrary goals, unintended instrumental drives, or worse – whatever is easiest to specify, not what is actually worth pursuing.

A constructive challenge

So, my question to the authors and the wider community is:

Which of the strategies outlined – if any – meaningfully engage with the real problem of value alignment?
And if none do: how should we revise the roadmap?This isn’t about tearing down the work – it’s about extending it. If we truly want to win the race safely, we need to ensure we’re not just steering clear of cliffs… but steering toward something actually worth reaching.

  1. Key paths, plans and strategies to AI safety success – https://bluedot.org/blog/ai-safety-paths-plans-and-strategies ↩︎
  2. Adam Jones’ post reads: “I spent 100+ hours analyzing every major AI safety plan. Here’s what I found.

    After interviewing 10+ AI safety researchers and reading 50+ “plans” I think there are 7 distinct strategies that keep emerging:

    International:
    🌍 International safe AI project (think CERN for AI)
    🛡️ Enforced moratorium on dangerous AI (IAEA-style treaties)
    ⚔️ Sabotage to prevent dangerous AI

    Domestic:
    🏆 Win the race safely
    🛠️ Building societal defences

    Meta:
    🎯 Combine point solutions
    🤝 Combine high-level strategies

    I think the most promising path is likely combining strategies:
    – Domestic actors trying to win the race safely…
    – which get nationalised and later contribute to an international safe AI project…
    – whose members also set up a nonproliferation IAEA-like organisation to main a lead and avoid intense races…
    – alongside governments and startups building robust societal defences.

    Which strategy resonates most with you? Or any that I’ve missed?” – https://www.linkedin.com/feed/update/urn:li:activity:7341502120019247104/
    ↩︎
  3. See post on Indirect Normativity ↩︎
  4. See Value Risk (V-Risk), Motivation Risk – and for a discussion on it, see Understanding V-Risk: Navigating the Complex Landscape of Value in AI and A Critique on ‘Complex Value Systems are Required to Realize Valuable Futures’ ↩︎

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *