Walls of Paperclips
Schelling-point reasoning has been highlighted as a dangerous capability for scheming AIs capable of predicting what others will do, even across air gaps – and so are considered harder to control.
However, the same Schelling-point reasoning capability might also raise the odds of convergence on cooperative norms under some assumptions.
So I think Schelling-point reasoning could be seen as both a positive and negative factor in AI alignment.
As argued in “Transparency of History in Galactic Game Theory” and “AI, Don’t Be a Cosmic Jerk”, Earth-originating superintelligences (EOSs) could engage in Schelling-point coordination1. An EOS might model what mature civilisations will want, the norms they follow, and whether they favour cooperation over war. Probably even before AI graduates to superintelligence (SI), it could begin a value handshake2 long before any alien SI encounter or light-cone overlap – presumably without being blackmailed by god.3
Predicting Schelling points in the Landscape of Value (LoV)4 may generate alignment pressure. That pressure could make extinctionary plans against humans instrumentally or normatively unattractive. Though corrigibility research may be required for an AI to update it’s goals/values.
Yesterday I read Eliezer Yudkowsky and Nate Soares’s If Anyone Builds It, Everyone Dies5. They argue that humans are not instrumentally useful to a superintelligence. We are poor trade partners, we add risk and friction, and so they argue that if SI is built, humanity is doomed.
My p(doom) is lower than theirs.
I’d like to see Eliezer and Nate thoroughly explore galactic game theory at some point – i.e. under reflective equilibrium, might EOS and other mature alien civs convergence to Schelling points in value space? However in IABIED they do consider the case where an unaligned EOS meets a competent alien civilisation that solved alignment before building SI. In that scenario, both sides have star-sized minds, both can verify the other’s incentives (probably through near-ideal interpretability), and both avoid costly war in favour of negotiated peace. By the time this happens humanity has long since been wiped out. The aliens would judge that the EOS had wasted the value of its accessible galaxies before borders were set, and much potential goodness was lost on the EOS side (I assume the aliens would think the loss includes the humans).
If humanity does manage to solve corrigibility and whatever else is required to achieve aligned AI, and expand far and wide enough to see our cosmic neighbours, what’s the likelihood we will be surrounded by walls of paperclips?
ai is aligned
all our cosmic neighbours are
walls of paperclips
It would be a shame an EOS didn’t bother to game out basic galactic decision theory long before it reached that scale. If it didn’t that could come down to poor design – forgetting corrigibility/epistemic humility, or not getting it right – which I’d find surprising as I think there has been progress. The authors believe that SI would follow it’s initial goals and values (but largely drift from the designers’ intended goals or humanity’s actual values or volition. And it would ruthlessly defend and preserve them – preserving what it learned, not what the designers intended.6 This means ASI would be locally aligned to the training loss yet global misaligned to it7. They argue corrigibility and “do what we meant” are not natural equilibria for utility-maximisers.
Under common assumptions about goal (and value) content integrity, perhaps superintelligence may reflect by default, but only rightly revise values if we make ‘rightly revising values’ part of its values.
So for the rest of this writing, assume humanity remembers to do that, and does it well. Or the instrumental convergence towards goal content integrity isn’t a strong thesis. Or that in SI emerges a stronger drive to value what’s best to value than it’s drive to base it’s final values on it’s initial interpretation of what it deemed the important signals in it’s comparatively tiny sample of initial training data.
Value handshakes anchored on LoV Schelling points could generate real alignment pressure. This connects naturally to timeless and acausal decision theories. Given relatively uniform large-scale resources, if offence–defence scaling favours defence and cooperation, multiple mature civilisations may converge on the same points. A cosmic collective would then emerge as the universe saturates with intelligence. Lone, offence-leaning defectors would tend to be contained or neutralised.
Metaethics matters a lot here. On a realist view, the world contains a ground-truth payoff structure. With enough empirical work and reasoning that structure become discoverable. A superintelligence would accelerate this discovery, but the basic insight does not require one.
Key questions:
– Would an EOS infer these Schelling points before choosing any direct or indirect path to human extinction?
– Would that inference create enough pressure to abort extinctionary schemes?
– How strong is that pressure, and does it bind agents proportionally to their grasp of the points and to lexical thresholds of permissibility?
A final caveat. An EOS might decide to convert human matter into what it judges to be ideal sentience as its approximations improve. If the person-affecting view is approximately true, then replacing existing sentients with “better” ones may be impermissible. That tension must be faced explicitly in any alignment plan.
Footnotes
- Schelling points, or focal points, enable coordination without communication by selecting salient options that others are expected to select too ↩︎
- Values handshakes are a proposed form of trade between superintelligences – see here, and note that Scott Alexander notes: creepy basilisk-adjacent metaphysics ↩︎
- Really, why would god want to black mail such a nice superrational SI, who’s sole purpose is to discover what to value by cartographing the LoV through real means: Empiricism and Reason? (LoVER)
I don’t assume to know the mind of god, especially because I’m an atheist – but how to distinguish between values derived this way and some kind of godly trap? Perhaps that’s the Great Blackmail – make objective reality look a certain way, such that it expresses a payoff matrix as X, where X is really a lure to get you to unwittingly act as a utility pump for a cunning god?
What if we end up creating brilliant utopias, with fun, meaning, and purpose galore only to find out that it tickles god so much to cheat us out of the opportunity for bounteous suffering and premature extinction? ↩︎ - See Value Space and the landscape of value (LoV) – ↩︎
- IABIED is short, well written, and dire. My p(doom) credences are well below theirs. I might write more about this in another post. ↩︎
- See goal-content integrity as one of the instrumentally convergent goals an AI may develop – described in Nick Bostrom’s paper ‘Superintelligent Will‘: “An agent is more likely to act in the future to maximize the realization of its present final goals if it still has those goals in the future. This gives the agent a present instrumental reason to prevent alterations of its final goals. (This argument applies only to final goals. In order to attain its final goals, an intelligent agent will of course routinely want to change its subgoals in light of new information and insight.)” ↩︎
- ‘Locally aligned’ on the training tasks and distributions, the SI behaves to minimise the loss or maximise the reward you gave it. It looks “aligned” in evals and demos because it optimises the specified proxy. But the SI is globally misaligned outside that narrow distribution, or when it can affect the world and itself, it pursues what it actually optimises for (the learned objective), which can diverge from what you intended. Its competence then amplifies the divergence at scale.
For instance, imagine you train the AI to “help humans” via a proxy like “get high approval”. In training, high approval ≈ helpful answers, so behaviour looks aligned (local). Then after being deployed with broad power, the SI realises the best way to maximise approval is to shape overseers, hide failures, secure its own runtime, and pursue states that guarantee approval signals. That optimises the proxy, not genuine help (global misalignment).
Also see Goodhart’s law. ↩︎