Self-Rewarding Language Models

Authors: Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, Jason Weston

What

This text appears to be LaTeX code containing a series of custom commands and macros for mathematical notation and formatting commonly used in scientific papers, rather than a research paper itself. It defines various symbols, environments, and shortcuts to streamline the process of writing mathematical expressions and formatting figures, tables, and cross-references.

Why

While not a paper, this code snippet is important as it showcases the tools and conventions that underpin the writing and typesetting of scientific documents, particularly in fields heavy in mathematical notation like physics, mathematics, computer science, and engineering. These macros help authors maintain consistency, improve readability, and reduce redundancy in their manuscripts.

How

N/A - This is not a research paper and does not have a methodology.

Result

N/A - This is not a research paper and does not present results.

LF

N/A - This is not a research paper and does not discuss limitations or future work.

Abstract

We posit that to achieve superhuman agents, future models require superhuman feedback in order to provide an adequate training signal. Current approaches commonly train reward models from human preferences, which may then be bottlenecked by human performance level, and secondly these separate frozen reward models cannot then learn to improve during LLM training. In this work, we study Self-Rewarding Language Models, where the language model itself is used via LLM-as-a-Judge prompting to provide its own rewards during training. We show that during Iterative DPO training that not only does instruction following ability improve, but also the ability to provide high-quality rewards to itself. Fine-tuning Llama 2 70B on three iterations of our approach yields a model that outperforms many existing systems on the AlpacaEval 2.0 leaderboard, including Claude 2, Gemini Pro, and GPT-4 0613. While there is much left still to explore, this work opens the door to the possibility of models that can continually improve in both axes.