Superalignment and Indirect Normativity


in 2023, OpenAI introduced their idea of superalignment here. It seems in some ways similar to Iterated Distillation and Amplification (Paul Christiano):
“Our goal is to build a roughly human-level automated alignment researcher⁠. We can then use vast amounts of compute to scale our efforts, and iteratively align superintelligence.”

MIRI (Eliezer Yudkowsky & Nate Sores) don’t think superalignment will work, and for the same reasons, probably don’t think indirect normativity will work.

In their book ‘If Anyone Builds It, Everyone Dies’ they say:

“In the case of weak superalignment: We agree that a relatively unintelligent AI could help with “interpretability research,” as it’s called. But learning to read some of an AI’s mind is not a plan for aligning it, any more than learning what’s going on inside atoms is a plan for making
a nuclear reactor that doesn’t melt down.
We consider interpretability researchers to be heroes, and do not mean to degrade their work when we say: It’s not a good sign, when you ask an engineer what their safety plan is, and they start telling you about their plans to build the tools that will give them a better window into
what the heck is going on inside the device they’re trying to control.
And even if the tools existed, being able to see problems is not the same as being able to fix them. The ability to read some of an AI’s thoughts, and see that it’s plotting to escape, is not the same as the ability to make a new AI that doesn’t want to escape. That might not be possible without a full solution to the alignment problem: Insofar as the AI has weird alien preferences, escape is in fact the course of action that best fulfills its objectives. Attempts to escape
are not a weird personality quirk that an engineer could rip out if only they could see what was going on inside; they’re generated by the same dispositions and capabilities that the AI uses to reason, to uncover truths about the world, to succeed in its pursuits.”

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *