Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation

Authors: Yunhao Ge, Xiaohui Zeng, Jacob Samuel Huffman, Tsung-Yi Lin, Ming-Yu Liu, Yin Cui

What

This paper introduces VisualFactChecker (VFC), a training-free pipeline that leverages large language models (LLMs) and existing computer vision models to generate detailed and factually grounded captions for both 2D images and 3D objects.

Why

This work addresses limitations in current open-source image captioning models, which often produce captions that are too concise or contain hallucinations (i.e., descriptions of elements not present in the image). VFC utilizes a unique approach of fact-checking generated captions using object detection and visual question answering (VQA), resulting in higher fidelity and accuracy compared to existing open-sourced models.

How

VFC operates through a three-step process: 1) Proposal: Multiple image captioning models generate initial captions. 2) Verification: An LLM uses object detection or VQA models to verify elements described in the captions. 3) Captioning: The LLM summarizes the initial captions and verification results to produce a final, factually grounded caption. The authors evaluate VFC on COCO (2D images) and Objaverse (3D objects) datasets using CLIP-Score, a novel CLIP-Image-Score, human evaluation via AMT, and GPT-4V for fine-grained analysis.

Result

VFC outperforms state-of-the-art open-source captioning methods in both 2D and 3D captioning tasks. Notably, it achieves performance comparable to proprietary models like GPT-4V despite being significantly smaller. The novel CLIP-Image-Score, introduced in this work, demonstrates effectiveness in detecting hallucinations by comparing original images with those reconstructed from generated captions.

LF

The authors acknowledge that the current implementation of VFC could be more automated in deciding which components to utilize for specific scenarios. Future work aims to address this limitation and explore the inclusion of additional components for fact-checking to further improve caption accuracy and detail.

Abstract

Existing automatic captioning methods for visual content face challenges such as lack of detail, content hallucination, and poor instruction following. In this work, we propose VisualFactChecker (VFC), a flexible training-free pipeline that generates high-fidelity and detailed captions for both 2D images and 3D objects. VFC consists of three steps: 1) proposal, where image-to-text captioning models propose multiple initial captions; 2) verification, where a large language model (LLM) utilizes tools such as object detection and VQA models to fact-check proposed captions; 3) captioning, where an LLM generates the final caption by summarizing caption proposals and the fact check verification results. In this step, VFC can flexibly generate captions in various styles following complex instructions. We conduct comprehensive captioning evaluations using four metrics: 1) CLIP-Score for image-text similarity; 2) CLIP-Image-Score for measuring the image-image similarity between the original and the reconstructed image generated by a text-to-image model using the caption. 3) human study on Amazon Mechanical Turk; 4) GPT-4V for fine-grained evaluation. Evaluation results show that VFC outperforms state-of-the-art open-sourced captioning methods for 2D images on the COCO dataset and 3D assets on the Objaverse dataset. Our study demonstrates that by combining open-source models into a pipeline, we can attain captioning capability comparable to proprietary models such as GPT-4V, despite being over 10x smaller in model size.