FIND: A Function Description Benchmark for Evaluating Interpretability Methods

Authors: Sarah Schwettmann, Tamar Rott Shaham, Joanna Materzynska, Neil Chowdhury, Shuang Li, Jacob Andreas, David Bau, Antonio Torralba

What

This paper introduces FIND (Function Interpretation and Description), a benchmark suite for evaluating the ability of automated methods to interpret and describe the behavior of black-box functions.

Why

This paper addresses the growing need for automated interpretability methods for increasingly complex AI models by introducing a benchmark to evaluate and compare these methods on functions with known structures.

How

The authors constructed FIND, a benchmark suite containing over 2000 procedurally generated functions with varying complexity and domains, including numeric, string, and synthetic neural modules. They evaluate different interpretation methods, including non-interactive (MILAN-like) and interactive (Automated Interpretability Agents), using off-the-shelf LMs like GPT-4, GPT-3.5, and Llama-2. Evaluation involves comparing generated descriptions with ground-truth explanations using code execution accuracy and a novel unit-testing protocol with a fine-tuned Vicuna-13b as an evaluator.

Result

GPT-4 consistently outperforms other LMs as an interpretability agent, demonstrating the potential of LMs for automated interpretability. However, even GPT-4 struggles with complex functions, highlighting the need for additional tools and techniques beyond current LMs. Initializing the AIA with exemplars dramatically improves performance, suggesting the importance of strategic data selection. The unit-testing protocol with the fine-tuned Vicuna evaluator demonstrates strong agreement with human judgments.

LF

The authors acknowledge that FIND focuses solely on black-box interpretation and lacks evaluation on real-world models. Future work will extend FIND to encompass white-box interpretation problems, including descriptions of individual components within neural circuits. Additionally, the authors aim to explore tools for enhanced sampling and fine-tuning LMs specifically for interpretability.

Abstract

Labeling neural network submodules with human-legible descriptions is useful for many downstream tasks: such descriptions can surface failures, guide interventions, and perhaps even explain important model behaviors. To date, most mechanistic descriptions of trained networks have involved small models, narrowly delimited phenomena, and large amounts of human labor. Labeling all human-interpretable sub-computations in models of increasing size and complexity will almost certainly require tools that can generate and validate descriptions automatically. Recently, techniques that use learned models in-the-loop for labeling have begun to gain traction, but methods for evaluating their efficacy are limited and ad-hoc. How should we validate and compare open-ended labeling tools? This paper introduces FIND (Function INterpretation and Description), a benchmark suite for evaluating the building blocks of automated interpretability methods. FIND contains functions that resemble components of trained neural networks, and accompanying descriptions of the kind we seek to generate. The functions span textual and numeric domains, and involve a range of real-world complexities. We evaluate methods that use pretrained language models (LMs) to produce descriptions of function behavior in natural language and code. Additionally, we introduce a new interactive method in which an Automated Interpretability Agent (AIA) generates function descriptions. We find that an AIA, built from an LM with black-box access to functions, can infer function structure, acting as a scientist by forming hypotheses, proposing experiments, and updating descriptions in light of new data. However, AIA descriptions tend to capture global function behavior and miss local details. These results suggest that FIND will be useful for evaluating more sophisticated interpretability methods before they are applied to real-world models.