A Minimaximalist Approach to Reinforcement Learning from Human Feedback
Authors: Gokul Swamy, Christoph Dann, Rahul Kidambi, Zhiwei Steven Wu, Alekh Agarwal
What
This paper theoretically investigates the sample complexity of offline imitation learning (IL) with a focus on the effects of dataset composition and function class complexity.
Why
The paper provides insights into crucial factors influencing the performance of IL algorithms, particularly in real-world scenarios where the quality and diversity of available data are significant concerns.
How
The authors derive upper and lower bounds on the sample complexity of IL under various settings. They analyze different dataset compositions (e.g., mixtures of experts, behavior-agnostic data) and consider their impact on the learning guarantees. Moreover, they utilize covering numbers and Rademacher complexities to characterize the complexity of the function class.
Result
The paper demonstrates that learning is possible even with a small proportion of expert data mixed with a larger amount of sub-optimal data. They also show the impact of the function class complexity on sample efficiency. Notably, simpler function classes lead to faster learning.
LF
The authors acknowledge limitations regarding the tightness of the derived bounds, especially in high-dimensional settings. They suggest exploring tighter bounds and practically relevant function classes in future work. Furthermore, they encourage the development of more efficient algorithms based on the theoretical insights gained from this study.
Abstract
We present Self-Play Preference Optimization (SPO), an algorithm for reinforcement learning from human feedback. Our approach is minimalist in that it does not require training a reward model nor unstable adversarial training and is therefore rather simple to implement. Our approach is maximalist in that it provably handles non-Markovian, intransitive, and stochastic preferences while being robust to the compounding errors that plague offline approaches to sequential prediction. To achieve the preceding qualities, we build upon the concept of a Minimax Winner (MW), a notion of preference aggregation from the social choice theory literature that frames learning from preferences as a zero-sum game between two policies. By leveraging the symmetry of this game, we prove that rather than using the traditional technique of dueling two policies to compute the MW, we can simply have a single agent play against itself while maintaining strong convergence guarantees. Practically, this corresponds to sampling multiple trajectories from a policy, asking a rater or preference model to compare them, and then using the proportion of wins as the reward for a particular trajectory. We demonstrate that on a suite of continuous control tasks, we are able to learn significantly more efficiently than reward-model based approaches while maintaining robustness to the intransitive and stochastic preferences that frequently occur in practice when aggregating human judgments.