Superalignment and Indirect Normativity

in 2023, OpenAI introduced their idea of superalignment here. It seems in some ways similar to Iterated Distillation and Amplification (IDA) (Paul Christiano):
“Our goal is to build a roughly human-level automated alignment researcher⁠. We can then use vast amounts of compute to scale our efforts, and iteratively align superintelligence.”

But superalignment failed to materialise at OpenAI.

Words are cheap talk. They’re evidence of what someone wants you to believe, not (by themselves) evidence of what they’ll actually do. Perceived benevolence can be a power strategy. Sometimes benevolence-signalling is central; sometimes power comes from competence, leverage, monopoly position, fear, or sheer momentum.

What’s more informative than words is costly signals, institutional constraints, and verifiable normative reasons.

OpenAI publicly promised 20% of secured compute over four years for its Superalignment effort.¹
Multiple reports around May 2024 describe Superalignment as under-resourced / deprioritised and later disbanded, with Jan Leike explicitly criticising the “shiny products” drift.²
There’s reporting that the 20% pledge was not fulfilled (at least as the team experienced it).³

Vague benevolence language is not a safety plan. Trust is not a control system – it’s a vibe.

MIRI (Eliezer Yudkowsky & Nate Sores) don’t think anything like superalignment will work, and for the same reasons, probably don’t think indirect normativity will work.

In their book ‘If Anyone Builds It, Everyone Dies’ they say:

“In the case of weak superalignment: We agree that a relatively unintelligent AI could help with “interpretability research,” as it’s called. But learning to read some of an AI’s mind is not a plan for aligning it, any more than learning what’s going on inside atoms is a plan for making a nuclear reactor that doesn’t melt down.
We consider interpretability researchers to be heroes, and do not mean to degrade their work when we say: It’s not a good sign, when you ask an engineer what their safety plan is, and they start telling you about their plans to build the tools that will give them a better window into what the heck is going on inside the device they’re trying to control.
And even if the tools existed, being able to see problems is not the same as being able to fix them. The ability to read some of an AI’s thoughts, and see that it’s plotting to escape, is not the same as the ability to make a new AI that doesn’t want to escape. That might not be possible without a full solution to the alignment problem: Insofar as the AI has weird alien preferences, escape is in fact the course of action that best fulfills its objectives. Attempts to escape are not a weird personality quirk that an engineer could rip out if only they could see what was going on inside; they’re generated by the same dispositions and capabilities that the AI uses to reason, to uncover truths about the world, to succeed in its pursuits.”

While Yudkowsky is widely associated with popularising an early approach to indirect normativity through CEV, more recent writings (IABIED and elsewhere) suggest he has become skeptical of this and other similar “value learning” approaches. He has expressed doubts that Paul Christiano’s approach to indirect normativity is on the right track and, along with MIRI, generally argues that current methods, including those described as “superalignment” or iterated distillation, are insufficient to safely control a superintelligence. He currently emphasises the need for a more mathematically rigorous form of control, believing the “landscape of value” is too sparse and dangerous for anything but a meticulously aligned system to navigate safely.

I diverge.

Footnotes

“We need scientific and technical breakthroughs to steer and control AI systems much smarter than us. To solve this problem within four years, we’re starting a new team, co-led by Ilya Sutskever and Jan Leike, and dedicating 20% of the compute we’ve secured to date to this effort. We’re looking for excellent ML researchers and engineers to join us.” see Introducing Superalignment – Jul 5 2023. ↩︎
OpenAI dissolves team focused on long-term AI risks, less than one year after announcing it , OpenAI’s Superalignment Team Disbanded Amid Leadership Departures ↩︎
“according to the sources, the team repeatedly saw its requests for access to graphics processing units, the specialized computer chips needed to train and run AI applications, turned down by OpenAI’s leadership, even though the team’s total compute budget never came close to the promised 20% threshold.” – see OpenAI promised 20% of its computing power to combat the most dangerous kind of AI—but never delivered, sources say ↩︎