Localized Symbolic Knowledge Distillation for Visual Commonsense Models

Authors: Jae Sung Park, Jack Hessel, Khyathi Raghavi Chandu, Paul Pu Liang, Ximing Lu, Peter West, Youngjae Yu, Qiuyuan Huang, Jianfeng Gao, Ali Farhadi, Yejin Choi

What

This paper introduces Localized Symbolic Knowledge Distillation (LSKD), a method for generating a localized visual commonsense corpus by prompting a large language model (LLM) with global and local image descriptions. This corpus is then used to train vision-language models that can accept region references as input, enabling more precise and context-aware reasoning within images.

Why

This paper is important because it addresses the limitations of existing vision-language models in performing localized reasoning within images. By enabling users to specify regions of interest, it paves the way for more intuitive and precise multimodal interactions. Furthermore, the paper demonstrates that machine-generated, localized visual commonsense corpora can be as effective as human-annotated datasets, opening new avenues for scalable and cost-effective model training.

How

The authors propose a multi-stage approach: 1) Image-to-text verbalization of global image content, local region descriptions, and dynamic question-answer pairs. 2) Prompting an LLM (ChatGPT) to generate localized commonsense knowledge in a question-answer-rationale format. 3) Training a supervised critic model to filter out erroneous or low-quality generated instances. 4) Fine-tuning vision-language models (e.g., BLIP-2) on the filtered corpus for both discriminative and generative localized visual reasoning tasks.

Result

Key findings include: 1) Training on the LSKD corpus significantly improves the performance of vision-language models on localized visual reasoning benchmarks, outperforming baselines and even surpassing models trained on human-annotated data in some cases. 2) A supervised critic model effectively filters out erroneous instances, leading to improved downstream task performance. 3) Generative models fine-tuned with LSKD show promising results in localized question-answering, demonstrating the potential for more interactive and human-like multimodal communication.

LF

The authors acknowledge limitations such as the potential for verbalizer errors and the coverage of question types in the generated corpus. Future work could focus on developing more robust verbalization techniques, expanding the diversity of question types, and exploring more sophisticated critic models to further enhance the quality and coverage of the generated knowledge.

Abstract

Instruction following vision-language (VL) models offer a flexible interface that supports a broad range of multimodal tasks in a zero-shot fashion. However, interfaces that operate on full images do not directly enable the user to “point to” and access specific regions within images. This capability is important not only to support reference-grounded VL benchmarks, but also, for practical applications that require precise within-image reasoning. We build Localized Visual Commonsense models, which allow users to specify (multiple) regions as input. We train our model by sampling localized commonsense knowledge from a large language model (LLM): specifically, we prompt an LLM to collect commonsense knowledge given a global literal image description and a local literal region description automatically generated by a set of VL models. With a separately trained critic model that selects high-quality examples, we find that training on the localized commonsense corpus can successfully distill existing VL models to support a reference-as-input interface. Empirical results and human evaluations in a zero-shot setup demonstrate that our distillation method results in more precise VL models of reasoning compared to a baseline of passing a generated referring expression to an LLM.