West-of-N: Synthetic Preference Generation for Improved Reward Modeling

Authors: Alizée Pace, Jonathan Mallinson, Eric Malmi, Sebastian Krause, Aliaksei Severyn

What

This paper introduces West-of-N, a novel method for improving reward models in Reinforcement Learning from Human Feedback (RLHF) by generating synthetic preference data using Best-of-N sampling from a language model.

Why

This work addresses the critical bottleneck of data scarcity in RLHF by proposing a scalable method for generating high-quality, on-policy preference data, potentially reducing the reliance on expensive and time-consuming human annotations.

How

The authors propose a self-training strategy where a base preference model (trained on initial data) is used to select the most and least preferred responses (West-of-N) from a pool generated by the language model. These synthetic preferences are then used to train a more accurate reward model.

Result

Empirical results show that West-of-N significantly improves reward model accuracy and downstream language model alignment, outperforming baseline methods like RLAIF and RLCD. Notably, the gains from West-of-N are comparable to doubling the amount of human preference data.

LF

Limitations include potential reward hacking by the base model when identifying West-of-N pairs with very large N. Future work could address this through reward model uncertainty estimation. Additionally, exploring other self-training techniques from the literature could further enhance West-of-N.

Abstract

The success of reinforcement learning from human feedback (RLHF) in language model alignment is strongly dependent on the quality of the underlying reward model. In this paper, we present a novel approach to improve reward model quality by generating synthetic preference data, thereby augmenting the training dataset with on-policy, high-quality preference pairs. Motivated by the promising results of Best-of-N sampling strategies in language model training, we extend their application to reward model training. This results in a self-training strategy to generate preference pairs by selecting the best and worst candidates in a pool of responses to a given query. Empirically, we find that this approach improves the performance of any reward model, with an effect comparable to the addition of a similar quantity of human preference data. This work opens up new avenues of research for improving RLHF for language model alignment, by offering synthetic preference generation as a solution to reward modeling challenges.