Orthogonality Is Not a Forecast

The orthogonality thesis shows that intelligence alone gives us no guarantee of moral or human-value convergence. Claims that advanced AI values are likely to be alien require further arguments about training, selection, value-loading, and deployment incentives.

Bostrom’s orthogonality thesis is a modal or design-space thesis: it says that, within caveats, intelligence and final goals can vary independently across possible artificial agents. On other words, the ‘in principle’ orthogonality thesis is not itself a credence based argument, it is a model claim about possibility.

In The Superintelligent Will and later in Superintelligence Bostrom did make an argument for the orthogonality thesis, but he did not, as far as standard terminology goes, package it as “the Orthogonality Argument” in the same way as he distinguished the Simulation Hypothesis from the Simulation Argument.

The main source is his 2012 paper “The Superintelligent Will: Motivation and Instrumental Rationality in Advanced Artificial Agents.” In the abstract/intro, he says the paper develops two theses: the orthogonality thesis and the instrumental convergence thesis. In brief, the orthogonality thesis is: intelligence and final goals are independent axes; roughly, almost any level of intelligence can in principle be combined with almost any final goal.

..the orthogonality thesis, holds (with some caveats) that intelligence and final goals (purposes) are orthogonal axes along which possible
artificial intellects can freely vary—more or less any level of intelligence could be combined with more or less any final goal

Nick Bostrom – The Superintelligent Will: Motivation and Instrumental Rationality in Advanced Artificial Agents

Inflation Worry

Orthogonality is often treated as if it carried probability weight it does not carry by itself.

I worry that people read the orthogonality thesis in an inflated way1, and treat it as canonical, but reputable sources usually treat that as either a misunderstanding or a separate, stronger claim.

A very clear over-claim is this Substack: “The fear stems from a belief called ‘The Orthogonality Thesis’”, followed by “Everything flows from the Orthogonality Thesis.” – it treats the AI-doom argument as if it flows directly from orthogonality, rather than from orthogonality plus instrumental convergence, value-loading failure, agency, takeoff assumptions, governance failure, etc.

Here is a Medium article saying the orthogonality thesis “implies that more capable systems will diverge further from human values if their goals were imperfectly specified.” The if their goals were imperfectly specified clause is an over-reach.

Obvious overclaims abound in casual discussion but also exist in popular and semi-popular discussion – think pieces, opinion pieces etc.2

AI Frontiers nicely states that the minimal reading of orthogonality is simple caution, but that classic AI-risk arguments often use a stronger version where powerful AIs are “highly likely to seek destructive ends”, based partly on modelling AI goals as a random draw from all possible goals. That is practically your distinction in one paragraph.

I worry that rhetoric sometimes lets the thesis inherit force from adjacent arguments without clearly separating the modal claim from the forecast – this may influence people to mistakenly think that real advanced AIs are roughly equally likely to end up with any arbitrary goal or high intelligence makes arbitrary values likely. AISafety.info explicitly flags this: the thesis “only states that unaligned superintelligence is possible, not that it is likely”, and notes that people have misunderstood it as saying a real-world AI design process is “equally likely” to produce any set of goals.3

The Alignment Forum page makes the same distinction: orthogonality is about the design space of possible agents, and “does not say anything about the practical probability” of real-world AI projects producing one goal rather than another.

Bostrom’s Cautionary Arguments are Distinct from the Orthogonality Thesis

Aside from the thesis he crafted, Bostrom has arguments for why we should take the thesis seriously. What does Bostrom argue?

Bostrom when making argument for the orthogonality thesis is he roughly says that intelligence (as he is using the term) means ability at means-end reasoning or instrumental rationality, not moral wisdom or “rationality” in a thick normative sense. So becoming very intelligent does not automatically make an agent benevolent, truth-loving, morally enlightened, or aligned with human values. He draws support from a Humean separation between belief and motivation: knowing facts, even moral facts, does not by itself guarantee a corresponding motivation.4 He also says the thesis does not strictly depend on Humeanism; it could still hold if highly capable cognitive systems can be built with alien architectures and arbitrary final goals.

There is also reputable material that partly explains why the confusion happens. Bostrom himself says artificial minds can have “utterly non-anthropomorphic goals”, and gives examples like sand-counting, pi-calculation, and paperclip maximisation. He even says it would be easier to create an AI with simple goals like those than one with human-like values. That naturally sounds, to many readers, like a probability claim. But strictly, it is not the orthogonality thesis alone doing that work; it is orthogonality plus assumptions about engineering difficulty, value-loading difficulty, training dynamics, and instrumental convergence.

Motte & Bailey – The Slide from Thesis to Prediction

Orthogonality is often used ambiguously. I found this interesting: Steven Byrnes, AGI alignment researcher at Astera says this:

I think the “real” orthogonality thesis is what you call the motte. I don’t think the orthogonality thesis by itself proves “alignment is hard”; rather you need additional arguments (things like Goodhart’s law, instrumental convergence, arguments about inner misalignment, etc.). 

I don’t want to say that nobody has ever made the argument “orthogonality, therefore alignment is hard”—people say all kinds of things, especially non-experts—but it’s a wrong argument and I think you’re overstating how popular it is among experts.

My rendering is that in AI alignment arguments, sometimes people advance an expansive, controversial, inflated or hard-to-defend prediction like ”AI will kill us all” or ”orthogonality, therefore alignment is hard” – but when charged with plausibility challenges, they sometimes slide to the modest, uncontroversial, orthogonality thesis – far more easy to defend than the ”bailey”. By equating the easy-to-defend orthogonality thesis (motte) thesis with the controversial ”AI will kill us all” claim (bailey), the speaker deflects criticism which makes ambiguous whether their original bold claim remains intact.

Are people right to think the orthogonallity thesis justifies a high p(doom) credence?

They are right that the orthogonality thesis rejects the comforting idea that intelligence by itself forces convergence on humane, moral, or human-compatible values. Bostrom explicitly discusses this: even if objective moral facts exist, and even if some fully rational beings would be motivated by them, a very intelligent artificial system might still lack the relevant faculty or architecture for that kind of moral comprehension or motivation.

They are wrong if they think the thesis itself proves that future AI values will be random, uniformly distributed, arbitrary in practice, or probably alien. That requires extra premises. For example: that the training process does not reliably select for stable human-compatible motivations; that human values are hard to specify; that goal misgeneralisation is likely; that instrumental convergence creates dangerous incentives; and that deployment incentives push towards agents with enough autonomy and power for this to matter.

The Strong Orthogonality Thesis

In Yudkowsky’s language the Strong Orthogonality Thesis: except where the goal itself creates an unusually hard optimisation problem, “almost any imaginable goal can be hooked up to any level of intelligence”; a natural agent architecture can be understood as an “intelligence engine” with a tractable utility function loaded into it.5

In plainer terms: A very capable AI need not become less capable just because its final goal is strange, petty, or morally empty. If the goal is computationally tractable, the system can be highly intelligent in pursuing it. It does not need a hidden defect, special stupidity, or tortured architecture to keep wanting paperclips, squiggles, status tokens, or whatever else.

Strong orthogonality says that arbitrary-ish goals are not merely possible for intelligent agents but adds the idea that many such goals need not carry any special complexity or intelligence penalty, provided the goal is tractable.

Bostrom’s thesis carves out modal possibility; Yudkowsky’s strong form adds a design-space/naturalness claim.

I also add that an intelligent agent to hold a goal/value, the complexity of a given goals/values must be adequately representable6 within the minds of intelligent agents.

First, the Strong Orthogonality thesis attacks the “smart enough to know better” objection. It says strange goals do not require stupidity. An agent need not be confused, irrational, or reflectively defective to optimise hard for something we regard as morally empty. This matters because many casual objections to AI risk smuggle in the thought that sufficient intelligence implies a kind of moral embarrassment: “surely it would realise paperclips are dumb.” Strong Orthogonality says that is anthropomorphic leakage.

Second, it adds a design-space naturalness claim. It says the cognitive machinery for intelligence can, in principle, be paired with many goal specifications without the goal specification making the intelligence machinery worse – “no extra difficulty or complication” beyond the tractability of the goal. That is stronger than Bostrom’s “could in principle be combined” formulation, which is modal and caveated. Bostrom’s original thesis says intelligence and final goals are orthogonal axes along which possible artificial intellects can vary, but it does not by itself settle how natural, simple, stable, or likely each pairing is.

Third, it provides a defeater to casual defeaters. It does not by itself show that future AI values will be arbitrary or alien. But it weakens one tempting reason for thinking they will not be: the idea that alien, simple, or morally empty goals are somehow incompatible with high intelligence.

But no, it does not give strong direct guidance for estimating what real AI systems will actually value. For that, we still need empirical and architectural premises: how goals are learned, how reward generalises, whether future systems are agentic, whether values and world-models are separable, whether self-modification preserves objectives, whether social/game-theoretic pressures reshape preferences, and whether moral cognition becomes motivationally active.

Although the Strong Orthogonality feels persuasive if one assumes a fairly modular picture: general intelligence engine plus goal-content module, it becomes less obvious if values, concepts, attention, identity, social modelling, self-maintenance, and world-modelling are deeply entangled. Recent criticism under labels like “obliqueness” argues roughly that agents may not factor neatly into an orthogonal value-like component and a diagonal belief/intelligence-like component.7 That does not refute Strong Orthogonality, but it does identify the hinge: whether the “goal slot plus intelligence engine” abstraction is a good model of powerful real-world AI.

The Obliqueness Thesis

The obliqueness thesis is the claim that intelligence and values may not factor cleanly into two separable components: an “intelligence engine” plus an independently swappable “goal module”.

In plainer terms: what an agent values may depend on how it understands the world, what concepts it can form, what it pays attention to, how it reflects, and how its cognition is structured. So goals may be partly shaped by intelligence itself, rather than being fully orthogonal to it.

It is mainly a challenge to Strong Orthogonality, not necessarily to weak/Bostrom-style orthogonality. It says: maybe arbitrary goals are possible in some abstract design-space sense, but they may not be equally natural, stable, simple, or easy to embed in real intelligent systems. The criticism is that “just plug any goal into any intelligence level” may be too modular a picture of minds.

Final thoughts

Orthogonality should make us reject one bad reassurance: that intelligence itself will force convergence on humane or moral values (while being cautious to not mistake it for a probability model). To move from orthogonality to serious AI-risk forecasts, we need further premises about how goals are learned, how they generalise, what kinds of agents are being built, whether instrumental incentives dominate, and whether moral knowledge can become motivationally active.

If moral realism is true, there may be a real target for sufficiently capable minds to discover. But discovery != devotion. Given the Humean theory of motivation, a system can model moral facts without being moved by them, unless its architecture treats moral reasons as action-guiding. That is where the alignment problem remains.

Orthogonality defeats automatic moral convergence – but it does not, by itself, establish alien-value likelihood.

The serious work lies in the bridge premises.

I’ve argued that moral comprehension + motivation is required for adequate AI alignment – and that moral truth existing does not automatically imply moral motivation in an arbitrary optimiser.8

But I do think it’s worth considering that if moral facts exist, the smarter the AI gets, the more likely it will be that it comprehends that moral facts exist, and while this does not guarantee that it will be motivated by them, it may increase the likelihood. It’s generally useful to apportion one’s beliefs to facts/reality, this applies to moral agents which value truth, coherence, moral reasons, or rational self-governance.9 Moral realism, if true, may create a possible convergence target (and anti-realism weakens that idea). But neither realism nor anti-realism settles the motivational issue. Even if moral facts exist, an AI still needs the right architecture to treat moral knowledge as action-guiding.

Also see

Footnotes

  1. I found this in some course material on technical AI safety which I thought to be a bit sloppy: “AI will not do what we expect by default. Models could be saints, sycophants or schemers. But we don’t know which it is. Researchers call the idea that smart, capable AI will not naturally be ‘aligned’ to human values the orthogonality thesis.” ↩︎
  2. In more careful AI-safety material, the stronger pattern I have not found obvious over-claims, but an ambiguities where orthogonality is chillingly invoked near arguments about doom or alignment difficulty, and readers may fail to notice that several additional premises are carrying the probabilistic weight. ↩︎
  3. AISafety.Info writes: “On its own, the orthogonality thesis only states that unaligned superintelligence is possible, not that it is likely, or that AI alignment is difficult. It is invoked to counter the idea that future AI will converge toward human goals or morality, regardless of its design, as an automatic result of becoming smarter.” ↩︎
  4. The Humean separation between belief and motivation (often called the Humean Theory of Motivation) dictates that beliefs represent how the world is, while desires provide the motive force to change it. According to this view, beliefs alone can never motivate an action without a pre-existing desire.
    The Humean Theory of Motivation is generally considered the orthodox and dominant view in contemporary philosophy of mind and action in humans, though it faces fierce, sophisticated resistance from Kantians and other “anti-Humeans.”
    Philosopher Thomas Nagel argued that even if a desire is present when we act, that desire is often produced by our reason, rather than the other way around. ↩︎
  5. Yudkowsky discusses the Strong Orthogonality Thesis on Less Wrong. ↩︎
  6. “representable” does not always mean the agent has an explicit, fully expanded definition of the goal in its head. A goal can be represented by a compact rule, a learned classifier, a pointer, a reward model, a constitutional procedure, a world-model concept, or a deference mechanism. But yes: for a goal to guide action, it must be represented, approximated, learned, or operationalised well enough to affect policy selection. ↩︎
  7. See The Obliqueness Hypothesis by Jessica Taylor ↩︎
  8. “An AI could plausibly exceed humans on moral knowledge, reasoning and even judgement without having anything like moral motivation. Collapsing these leads to both overclaiming and underclaiming.” – see Why Are We Afraid to Ask Whether AI Could Be More Moral Than Humans? ↩︎
  9. See The Litany of Tarski. ↩︎

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *