Using Captum to Explain Generative Language Models

Authors: Vivek Miglani, Aobo Yang, Aram H. Markosyan, Diego Garcia-Olano, Narine Kokhlikyan

What

This paper introduces new features in Captum v0.7, a model explainability library for PyTorch, specifically designed to analyze the behavior of generative language models like GPT-3.

Why

This paper is important as it addresses the growing need for explainability in large language models (LLMs) by introducing new tools in Captum that enhance the understanding of these models, especially for critical applications.

How

The authors introduce new functionalities in Captum, focusing on perturbation-based (Feature Ablation, LIME, Kernel SHAP, Shapley Value Sampling) and gradient-based (Saliency, Integrated Gradients) attribution methods. They provide APIs to define custom features, baselines, masking, and target selection for analyzing LLM behavior.

Result

The paper showcases the application of new Captum functionalities in understanding model associations, revealing potential biases by analyzing attribution scores for input features. Additionally, they demonstrate the evaluation of few-shot prompt effectiveness, highlighting an unexpected reduction in confidence for a sentiment prediction task.

LF

The authors acknowledge limitations in current attribution methods and highlight the need for automated feature and baseline selection. Future work involves incorporating other interpretability techniques, improving automation, and optimizing runtime performance for the open-source community.

Abstract

Captum is a comprehensive library for model explainability in PyTorch, offering a range of methods from the interpretability literature to enhance users’ understanding of PyTorch models. In this paper, we introduce new features in Captum that are specifically designed to analyze the behavior of generative language models. We provide an overview of the available functionalities and example applications of their potential for understanding learned associations within generative language models.