Mismatch Quest: Visual and Textual Feedback for Image-Text Misalignment

Authors: Brian Gordon, Yonatan Bitton, Yonatan Shafir, Roopal Garg, Xi Chen, Dani Lischinski, Daniel Cohen-Or, Idan Szpektor

What

This paper presents a method to explain the misalignment between text and images in image-text alignment models by leveraging LLMs and visual grounding models to generate plausible misaligned captions and their corresponding textual and visual explanations.

Why

This paper is important because it addresses the limitation of existing image-text alignment models which only provide a binary assessment of alignment and fail to pinpoint the source of misalignment. The proposed method enables detailed understanding of misalignment causes and facilitates the development of better image-text alignment models.

How

The authors propose a method called Mismatch-Quest which first collects aligned image-text pairs from various datasets, then utilizes LLMs to generate misaligned captions along with their textual and visual explanations. To ensure quality, they validate the generated captions and feedback using entailment models and utilize a visual grounding model to annotate the misalignments with bounding boxes.

Result

The authors create a comprehensive training set named TV-Feedback with 3 million instances. They also introduce a human-annotated test set named Mismatch-Quest Benchmark with 2,008 instances. Fine-tuning PaLI vision language models on TV-Feedback outperforms other baselines on both binary alignment classification and explanation generation tasks, achieving over 10% improvement in alignment accuracy and 20% in textual feedback entailment.

LF

The authors identify limitations like failing to handle scenarios with no visual feedback expected and struggling with instances requiring identification of multiple misalignments. Future work includes enriching the training set with such scenarios to improve the model’s ability to address diverse misalignment types.

Abstract

While existing image-text alignment models reach high quality binary assessments, they fall short of pinpointing the exact source of misalignment. In this paper, we present a method to provide detailed textual and visual explanation of detected misalignments between text-image pairs. We leverage large language models and visual grounding models to automatically construct a training set that holds plausible misaligned captions for a given image and corresponding textual explanations and visual indicators. We also publish a new human curated test set comprising ground-truth textual and visual misalignment annotations. Empirical results show that fine-tuning vision language models on our training set enables them to articulate misalignments and visually indicate them within images, outperforming strong baselines both on the binary alignment classification and the explanation generation tasks. Our method code and human curated test set are available at: https://mismatch-quest.github.io/