RAIN: Your Language Models Can Align Themselves without Finetuning

attributed to: Yuhui Li, Fangyun Wei, Jinjing Zhao, Chao Zhang, Hongyang Zhang

posted by: momom2

Large language models (LLMs) often demonstrate inconsistencies with human preferences. Previous research typically gathered human preference data and then aligned the pre-trained models using reinforcement learning or instruction tuning, a.k.a. the finetuning step. In contrast, aligning frozen LLMs without requiring alignment data is more appealing. This work explores the potential of the latter setting. We discover that by integrating self-evaluation and rewind mechanisms, unaligned LLMs can directly produce responses consistent with human preferences via self-boosting. We introduce a novel inference method, Rewindable Auto-regressive INference (RAIN), that allows pre-trained LLMs to evaluate their own generation and use the evaluation results to guide rewind and generation for AI safety. Notably, RAIN operates without the need of extra data for model alignment and abstains from any training, gradient computation, or parameter updates. Experimental results evaluated by GPT-4 and humans demonstrate the effectiveness of RAIN: on the HH dataset, RAIN improves the harmlessness rate of LLaMA 30B from 82% of vanilla inference to 97%, while maintaining the helpfulness rate. On the TruthfulQA dataset, RAIN improves the truthfulness of the already-well-aligned LLaMA-2-chat 13B model by 5%.

What part of the alignment problem does this plan aim to solve? - Towards HH AI at low human-compute cost: a prosaic variant of outer alignment. Why has that part of the alignment problem been chosen? - Because it's easily measurable and the subject of a lot of research in academia ML. How does this plan aim to solve the problem? - By providing a new method, Rewindable Auto-regressive INference (RAIN), that "allows pre-trained LLMs to evaluate their own generation and use the evaluation results to guide rewind and generation for AI safety". What evidence is there that the methods will work? - The usual graphs and stats and benchmarks. Hard to tell if it's legit without spending a couple hours thinking about whether the authors measured what they think they measured and how relevant it is. What are the most likely causes of this not working? - Results being selected by the authors to achieve publication, despite being irrelevant for the underlying motivation. - Impractical method due to being inefficient (in compute, human cost or training data) relative to other similar methods. Irrelevant due to not being interesting enough to actually get implemented in any frontier AI. - Prosaic alignment being effectively useless due to not addressing the parts of alignment relevant to existential risk.

Vulnerabilities & Strengths