Preference Fine-Tuning of LLMs Should Leverage Suboptimal, On-Policy Data

Authors: Fahim Tajwar, Anikait Singh, Archit Sharma, Rafael Rafailov, Jeff Schneider, Tengyang Xie, Stefano Ermon, Chelsea Finn, Aviral Kumar

What

This paper investigates the effectiveness of different fine-tuning methods for large language models (LLMs) on tasks involving binary preferences, particularly focusing on the roles of on-policy sampling and negative gradients.

Why

This paper provides clarity on the effectiveness and trade-offs of various LLM fine-tuning methods, guiding practitioners in selecting the best approach for their specific preference optimization problem. It unifies seemingly distinct notions of on-policy sampling and negative gradients under the concept of mode-seeking objectives, which helps in understanding the behavior of different algorithms.

How

The authors conduct a rigorous empirical study using a variety of tasks, including didactic bandit problems, synthetic LLM problems with hand-crafted reward functions, and full-scale LLM fine-tuning problems with real human preference data from AlpacaFarm and UltraFeedback. They analyze the performance of different algorithms (PPO, REINFORCE, DPO, IPO, RWR, Pref-FT, Best-of-N) by varying the degree of on-policy sampling and use of negative gradients.

Result

The key findings are that on-policy sampling significantly improves performance and efficiency, especially when the reward peak is far from the reference policy. Negative gradients are also beneficial, leading to faster convergence, and complement on-policy sampling. The study finds that both techniques are unified by the concept of mode-seeking divergences, which prioritize sharpening probability mass on high-reward regions, as opposed to mode-covering objectives like maximum likelihood.

LF

The paper acknowledges limitations in terms of lacking rigorous statistical guarantees for the observed benefits of on-policy sampling and negative gradients. Future work could involve formalizing these benefits statistically. Further exploration could incorporate the role of pre-training distribution coverage, reward model quality, and recent minimax formulations in preference optimization.

Abstract

Learning from preference labels plays a crucial role in fine-tuning large language models. There are several distinct approaches for preference fine-tuning, including supervised learning, on-policy reinforcement learning (RL), and contrastive learning. Different methods come with different implementation tradeoffs and performance differences, and existing empirical findings present different conclusions, for instance, some results show that online RL is quite important to attain good fine-tuning results, while others find (offline) contrastive or even purely supervised methods sufficient. This raises a natural question: what kind of approaches are important for fine-tuning with preference data and why? In this paper, we answer this question by performing a rigorous analysis of a number of fine-tuning techniques on didactic and full-scale LLM problems. Our main finding is that, in general, approaches that use on-policy sampling or attempt to push down the likelihood on certain responses (i.e., employ a “negative gradient”) outperform offline and maximum likelihood objectives. We conceptualize our insights and unify methods that use on-policy sampling or negative gradient under a notion of mode-seeking objectives for categorical distributions. Mode-seeking objectives are able to alter probability mass on specific bins of a categorical distribution at a fast rate compared to maximum likelihood, allowing them to relocate masses across bins more effectively. Our analysis prescribes actionable insights for preference fine-tuning of LLMs and informs how data should be collected for maximal improvement.