Vision-Language Models as a Source of Rewards

Authors: Kate Baumli, Satinder Baveja, Feryal Behbahani, Harris Chan, Gheorghe Comanici, Sebastian Flennerhag, Maxime Gazeau, Kristian Holsheimer, Dan Horgan, Michael Laskin, Clare Lyle, Hussain Masoom, Kay McKinney, Volodymyr Mnih, Alexander Neitz, Fabio Pardo, Jack Parker-Holder, John Quan, Tim Rocktäschel, Himanshu Sahni, Tom Schaul, Yannick Schroecker, Stephen Spencer, Richie Steigerwald, Luyu Wang, Lei Zhang

What

This paper investigates the use of off-the-shelf vision-language models (VLMs), specifically CLIP, as reward functions for reinforcement learning agents in visual environments, enabling them to achieve language-specified goals.

Why

This paper is important because it addresses a key challenge in building generalist RL agents: the need for numerous, manually-designed reward functions. Using VLMs as reward generators has the potential to significantly improve the scalability and efficiency of training agents that can perform diverse tasks in complex environments.

How

The authors propose a method to derive a binary reward signal from CLIP by: (1) computing the probability of goal achievement based on cosine similarity between image and text embeddings, and (2) thresholding this probability. They then use this reward to train RL agents in two visual domains: Playhouse and AndroidEnv, evaluating the agent’s performance on achieving various language-specified goals.

Result

The key findings suggest that maximizing the VLM-derived reward leads to an improvement in ground truth reward, indicating the effectiveness of VLMs as reward functions. The authors also show that larger VLMs lead to more accurate rewards and subsequently better agent performance. Furthermore, they demonstrate the importance of prompt engineering in improving the performance of the VLM reward model.

LF

The paper acknowledges limitations regarding the potential for reward hacking, which was not observed within the scope of their experiments. Future work could explore generalizing negative sampling from generative distributions, such as LLMs. Additionally, exploring the impact of VLM advancements on training generalist agents without domain-specific fine-tuning is suggested.

Abstract

Building generalist agents that can accomplish many goals in rich open-ended environments is one of the research frontiers for reinforcement learning. A key limiting factor for building generalist agents with RL has been the need for a large number of reward functions for achieving different goals. We investigate the feasibility of using off-the-shelf vision-language models, or VLMs, as sources of rewards for reinforcement learning agents. We show how rewards for visual achievement of a variety of language goals can be derived from the CLIP family of models, and used to train RL agents that can achieve a variety of language goals. We showcase this approach in two distinct visual domains and present a scaling trend showing how larger VLMs lead to more accurate rewards for visual goal achievement, which in turn produces more capable RL agents.