RLAIF: Scaling Reinforcement Learning from Human Feedback with AI Feedback

A key part of AI training is RLHF (Reinforcement Learning through Human Feedback), where humans rate the quality of AI output, in the process making AI results less harmful & more useful. This is both very expensive and time consuming, and also very controversial. Some of the RLHF process involves using low-paid workers who are exposed to disturbing AI content, in order to flag that content as inappropriate.

So it is a potentially big deal that this paper found AIs can fill the role of humans in the RLHF process. This new RLAIF approach is much cheaper and avoids harms to potential human raters. It appears to perform as well as traditional human methods, and may lower the rates of hallucinations as well. – ETHAN MOLLICK

