Testing Language Model Agents Safely in the Wild

Authors: Silen Naihin, David Atkinson, Marc Green, Merwane Hamadi, Craig Swift, Douglas Schonholtz, Adam Tauman Kalai, David Bau

What

This paper proposes a framework for conducting safe tests of autonomous language model agents on the open internet by introducing a context-sensitive safety monitor that can identify and stop unsafe agent actions.

Why

As language model agents become increasingly capable and prevalent, it’s crucial to ensure they are tested safely in real-world environments to prevent potential harm and build trust in their deployment.

How

The authors developed a dataset of agent outputs, including manually crafted unsafe examples, and designed a safety monitor (AgentMonitor) based on GPT-3.5-turbo. They trained and evaluated the monitor’s ability to identify and stop unsafe actions using various parameters like task context, previous actions, and whitelists.

Result

The AgentMonitor achieved promising results on a test set, with an F1 score of 89.4%. Ablation studies revealed that access to the agent’s previous context was crucial for the monitor’s performance. The authors also highlighted the need for well-specified threat models and comprehensive example sets for few-shot learning in the monitor.

LF

The authors identify limitations such as the need for larger, better-categorized datasets of attacks and a clearer distinction between off-task and unsafe outputs. Future work will focus on improving the AgentMonitor’s ability to make this distinction, minimizing the need for human intervention in safe testing.

Abstract

A prerequisite for safe autonomy-in-the-wild is safe testing-in-the-wild. Yet real-world autonomous tests face several unique safety challenges, both due to the possibility of causing harm during a test, as well as the risk of encountering new unsafe agent behavior through interactions with real-world and potentially malicious actors. We propose a framework for conducting safe autonomous agent tests on the open internet: agent actions are audited by a context-sensitive monitor that enforces a stringent safety boundary to stop an unsafe test, with suspect behavior ranked and logged to be examined by humans. We design a basic safety monitor (AgentMonitor) that is flexible enough to monitor existing LLM agents, and, using an adversarial simulated agent, we measure its ability to identify and stop unsafe situations. Then we apply the AgentMonitor on a battery of real-world tests of AutoGPT, and we identify several limitations and challenges that will face the creation of safe in-the-wild tests as autonomous agents grow more capable.