Causal Incentives and Safe AGI – Tom Everitt
This talk is part of the ‘Stepping Into the Future‘ conference.
Synopsis: Along with many benefits of powerful machine learning methods comes significant challenges. For example, as content recommendation algorithms become increasingly competent at satisfying user preferences, they may also become more competent at manipulating human preferences, to make the preferences more easily satisfiable. These kinds of alignment problems will only become more severe as we approach AGI, unless we find a way to address them.
In this talk, I’ll present an approach to AGI alignment based on causal influence diagrams. These extend Pearl’s causal graphs with special decision and utility nodes, and provide a high-level representation of an agent’s environment, objectives, decisions, information, and means of influence, in a form that is both precise and visually intuitive.
The approach enables us to
* pinpoint key differences between various alignment proposals like recursive reward modeling, debate, cooperative inverse reinforcement learning,
* clarify subtle incentive issues for both the safety and fairness of machine learning systems,
* understand when agents can grasp ethical concepts like intent, control, harm, deception, …
* design agents who don’t let the ends justify the means
Bio: Tom Everitt is a researcher at DeepMind.
He is working on AGI Safety, i.e. how we can safely build and use highly intelligent AI.
Tom’s PhD thesis Towards Safe Artificial General Intelligence is the first PhD thesis specifically devoted to this topic.
It was supervised by Marcus Hutter at the Australian National University.
See more at Tom’s webpage.