The Incomplete Preferences Proposal (IPP)

attributed to: Elliott Thornley

posted by: EJT

The Incomplete Preferences Proposal (IPP) is a proposed solution to the shutdown problem: the problem of ensuring that artificial agents never resist our attempts to shut them down. The idea is to train agents that lack a preference between every pair of different-length trajectories (that is: every pair of trajectories in which shutdown occurs after different lengths of time). These agents won't pay costs to prevent or cause shutdown. The IPP includes a proposed reward function for training these agents. Since these agents won't resist shutdown, the risk that they overthrow humanity is approximately zero.

What part of the alignment problem does this plan aim to solve? This plan aims to solve the shutdown problem. Why has that part of the alignment problem been chosen? Because the shutdown problem is tractable, and because a solution (if widely implemented) would push the risk of AI takeover down to approximately zero. How does this plan aim to solve the problem? By training agents to lack a preference between every pair of different-length trajectories, and thereby ensuring that these agents aren't willing to pay costs to prevent or cause shutdown. What evidence is there that the methods will work? See the proposal. There are arguments that agents trained in line with the IPP won't pay costs to prevent or cause shutdown (in section 11). There are also arguments that the IPP largely circumvents the problems of reward misspecification, goal misgeneralization, and deceptive alignment (in section 19). We're also working on an experiment to test the proposed reward function in some gridworld environments. What are the most likely causes of this not working? See the proposal. Section 21 lists some issues still to address, including multi-agent dynamics, maintaining the shutdown button, creating corrigible subagents, and 'managing the news'.

Vulnerabilities & Strengths