Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs

Authors: Kanchana Ranasinghe, Satya Narayan Shukla, Omid Poursaeed, Michael S. Ryoo, Tsung-Yu Lin

What

This paper introduces a novel framework called Locate-Anything to enhance the spatial awareness of Visual-LLMs (V-LLMs) by incorporating textual image-space coordinates into both the input prompts and the LLM-generated outputs.

Why

This research is important because it addresses a critical limitation of current V-LLMs: their weak spatial reasoning and localization abilities. By improving the spatial awareness of V-LLMs, this work enables more comprehensive visual understanding and opens up new possibilities for vision-language tasks.

How

The authors propose three novel instruction fine-tuning objectives that leverage textual coordinate representations: Location Prediction, Negative Prediction, and Reverse-Location Prediction. They explore different coordinate representation schemes and introduce pseudo-data generation strategies to enhance data efficiency and extend the framework to video domains.

Result

The proposed Locate-Anything model demonstrates significant improvements in spatial reasoning, outperforming existing V-LLMs in tasks like distinguishing object positions. It achieves state-of-the-art results on Image VQA, Video VQA, and Region Description benchmarks while effectively reducing object hallucination.

LF

The paper identifies limitations in understanding temporal locations for video-based tasks, suggesting future work on incorporating time coordinates. Additionally, potential biases within training datasets are acknowledged, highlighting the need for careful consideration during model deployment.

Abstract

Integration of Large Language Models (LLMs) into visual domain tasks, resulting in visual-LLMs (V-LLMs), has enabled exceptional performance in vision-language tasks, particularly for visual question answering (VQA). However, existing V-LLMs (e.g. BLIP-2, LLaVA) demonstrate weak spatial reasoning and localization awareness. Despite generating highly descriptive and elaborate textual answers, these models fail at simple tasks like distinguishing a left vs right location. In this work, we explore how image-space coordinate based instruction fine-tuning objectives could inject spatial awareness into V-LLMs. We discover optimal coordinate representations, data-efficient instruction fine-tuning objectives, and pseudo-data generation strategies that lead to improved spatial awareness in V-LLMs. Additionally, our resulting model improves VQA across image and video domains, reduces undesired hallucination, and generates better contextual object descriptions. Experiments across 5 vision-language tasks involving 14 different datasets establish the clear performance improvements achieved by our proposed framework.