Probing the 3D Awareness of Visual Foundation Models

Authors: Mohamed El Banani, Amit Raj, Kevis-Kokitsi Maninis, Abhishek Kar, Yuanzhen Li, Michael Rubinstein, Deqing Sun, Leonidas Guibas, Justin Johnson, Varun Jampani

What

This paper investigates the 3D awareness of visual foundation models, examining how well these models represent the 3D structure of scenes and objects from single and multiple views.

Why

This paper is important because it addresses the lack of understanding regarding how well visual foundation models, despite being trained on 2D data, represent the 3D world. This understanding is crucial as these models are increasingly used as backbones for 3D vision tasks.

How

The authors evaluate a range of visual foundation models, including those trained with classification, language supervision, self-supervision, and dense supervision, on their ability to estimate depth, surface normals, and 3D correspondence. They probe the frozen representations of these models using task-specific probes and zero-shot inference methods to assess the inherent 3D awareness of the learned features.

Result

The analysis revealed that self-supervised models perform best in capturing surface properties like depth and normals, followed by text-conditioned generative models. However, all models struggled with multiview consistency, particularly at large viewpoint changes, indicating they might be learning view-dependent rather than truly 3D-consistent representations. Semantic correspondence performance was found to be more correlated with single-view tasks than multiview tasks, suggesting it might not be a reliable measure of 3D consistency.

LF

The paper acknowledges limitations including the use of publicly available checkpoints trained on different datasets and with varying compute resources, potentially confounding the results. They suggest future work should focus on more controlled experiments to isolate the impact of training signals and explore a broader range of 3D understanding aspects beyond surface reconstruction and multiview consistency.

Abstract

Recent advances in large-scale pretraining have yielded visual foundation models with strong capabilities. Not only can recent models generalize to arbitrary images for their training task, their intermediate representations are useful for other visual tasks such as detection and segmentation. Given that such models can classify, delineate, and localize objects in 2D, we ask whether they also represent their 3D structure? In this work, we analyze the 3D awareness of visual foundation models. We posit that 3D awareness implies that representations (1) encode the 3D structure of the scene and (2) consistently represent the surface across views. We conduct a series of experiments using task-specific probes and zero-shot inference procedures on frozen features. Our experiments reveal several limitations of the current models. Our code and analysis can be found at https://github.com/mbanani/probe3d.