| |

J. Dmitri Gallow – AI Interpretability, Orthogonality, Instrumental Convergence & Divergence

Interview conducted at Melbourne Uni at the EAGx Melbourne 2023 conference on Sunday 24th of September.

J. Dmitri Gallow, a philosopher focused on causation, decision theory, and metaphysical questions in the philosophy of science, has been involved in AI safety research since 2023. He authored the paper Instrumental Divergence.

Instrumental Divergence

Abstract of paper on instrumental divergence: The thesis of instrumental convergence holds that a wide range of ends have common means: for instance, self preservation, desire preservation, self improvement, and resource acquisition. Bostrom contends that instrumental convergence gives us reason to think that “the default outcome of the creation of machine superintelligence is existential catastrophe”. I use the tools of decision theory to investigate whether this thesis is true. I find that, even if intrinsic desires are randomly selected, instrumental rationality induces biases towards certain kinds of choices. Firstly, a bias towards choices which leave less up to chance. Secondly, a bias towards desire preservation, in line with Bostrom’s conjecture. And thirdly, a bias towards choices which afford more choices later on. I do not find biases towards any other of the convergent instrumental means on Bostrom’s list. I conclude that the biases induced by instrumental rationality at best weakly support Bostrom’s conclusion that machine superintelligence is likely to lead to existential catastrophe.

Interview summary

Transition to AI Safety: Gallow’s interest in AI safety was piqued by the advancements in AI and large language models. His conversation with Dave Chalmers led him to join the Center for AI Safety, seeking to understand the safety concerns in AI.

Importance of AI Safety: He emphasizes the need for caution with new, powerful technologies like AI. He highlights the potential catastrophic risks associated with AI, advocating for careful and cautious progression in AI development.

Interpretability: Gallow points out the importance of understanding AI’s internal functioning for safety purposes. He shares an anecdote about a large language model (likely GPT-4), which produced unexpected outputs due to obscure training data. This unpredictability underscores the necessity of interpretability in AI.

Instrumental Convergence and Divergence: He discusses the principles of instrumental convergence and divergence in AI. The orthogonality thesis, which states intelligence and desire are independent, and the instrumental convergence thesis, which suggests intelligent beings will have similar instrumental desires, are critical concepts. Gallow’s argument focuses on the instrumental divergence, which emerges from the complexity and unpredictability of AI’s actions based on its desires.

Power-Seeking Behavior in AI: Addressing the topic of AI seeking power, Gallow suggests that the concept of ‘power’ is difficult to define in the context of AI. He questions the prevalent assumptions about AI’s power-seeking behavior and advocates for a more empirical approach to understanding AI actions.

Empirical Approach to AI Behavior: Gallow proposes an empirical approach to studying AI behavior rather than relying on a priori arguments. He mentions the need to understand AI’s behavior in specific environments and the importance of interpretability in managing AI risks effectively.

Final Thoughts on AI Safety: Gallow concludes that the most effective way to mitigate risks from AI is to improve our understanding of AI’s ‘mind’ through interpretability. This approach will enable better preparation and response to AI’s deployment in real-world applications.

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *