Corrigibility via multiple routes

https://www.lesswrong.com/posts/AqsjZwxHNqH64C2b6/let-s-see-you-write-that-corrigibility-tag Principles which counteract instrumental convergent goals 1. Disutility from resource acquisition - e.g. by some mutual information measure between the AI and distant parts of the environment 2. Task uncertainty with reasonable prior on goal drift - the system is unsure about the task it tries to do and seeks human inputs about it. 3. AI which ultimately wants to not exist in future as a terminal goal. Fulfilling the task is on the simplest trajectory to non-existence Principles which counteract unbounded rationality 4. Information-theoretic penalty on reasoning: any bit-flip in the system is assumed to have some utility cost. The cost function could be very small initially but get steeper fast 5. Satisfycing / mentioned 6. Myopia / mentioned Traps 7. Tripwire artifacts. Messing up with some to the system unknown but unrelated parts of the environment is associated with large negative rewards 8. External watchdogs. Smaller and fast external systems able to react quickly to out-of-distribution behaviour. 9. Ontological uncertainty about level of simulation. Oversight 10. Human-approval model based on imitation learning, sped up/amplified 11. Human-values ethics model, based on value learning 12. Legal-system-amplified model of negative limits of violating property rights or similar 13. Red-teaming of action plans, AI debate style, feeding into previous Interpretability 14. Imposing strong incentives on internal modularity, and interpretable messaging across module boundaries 15. Human-level explanations, produced by an independent "translator" system

Vulnerabilities & Strengths