View Selection for 3D Captioning via Diffusion Ranking

Authors: Tiange Luo, Justin Johnson, Honglak Lee

What

This paper tackles the issue of hallucination in 3D object captioning, particularly in the Cap3D method, by introducing DiffuRank, a technique that uses a pre-trained text-to-3D model to rank rendered 2D views of 3D objects based on their alignment with the object’s characteristics, resulting in more accurate and detailed captions.

Why

This work is important because it addresses a key challenge in building large-scale 3D-text datasets: the generation of inaccurate captions due to the limitations of existing captioning models when presented with atypical or challenging views of 3D objects. By improving the accuracy and richness of 3D captions, this work can significantly benefit various 3D-related applications, including text-to-3D generation, image-to-3D conversion, robot learning, and 3D language model pre-training.

How

The authors developed DiffuRank, an algorithm that leverages a pre-trained text-to-3D diffusion model to assess the alignment between different rendered 2D views of a 3D object and the object itself. They generated multiple captions for each view using an image captioning model and fed them into the diffusion model alongside the 3D object’s features. By ranking the views based on their average score (loss) in the diffusion model, they identified the views that best represent the object’s 3D information. These top-ranked views were then passed to GPT4-Vision for generating the final captions.

Result

The authors demonstrate that DiffuRank, in conjunction with GPT4-Vision, significantly improves the quality of captions for 3D objects. Key findings include: (1) DiffuRank effectively reduces hallucinations in captions, as evidenced by human studies and automated metrics. (2) Captions generated using DiffuRank are richer in detail and more accurate compared to those produced using all rendered views or a fixed set of horizontally placed views. (3) Using fewer but more informative views selected by DiffuRank can lead to better captions than using a large number of views indiscriminately. (4) DiffuRank can be extended to 2D domains and has shown promising results in Visual Question Answering tasks, outperforming CLIP on a challenging benchmark.

LF

The authors acknowledge the limitations of DiffuRank, particularly its computational cost due to the need for rendering multiple views, generating captions for each view, and running inference through a diffusion model. The speed of DiffuRank is a bottleneck, especially for tasks involving numerous options, such as classification or image-text retrieval. Future work could focus on improving the efficiency of DiffuRank to make it more scalable for such tasks. Additionally, the authors suggest exploring the use of even more powerful text-to-3D and captioning models to further enhance the accuracy and detail of the generated captions. Expanding the dataset to encompass all of Objaverse-XL is another avenue for future work.

Abstract

Scalable annotation approaches are crucial for constructing extensive 3D-text datasets, facilitating a broader range of applications. However, existing methods sometimes lead to the generation of hallucinated captions, compromising caption quality. This paper explores the issue of hallucination in 3D object captioning, with a focus on Cap3D method, which renders 3D objects into 2D views for captioning using pre-trained models. We pinpoint a major challenge: certain rendered views of 3D objects are atypical, deviating from the training data of standard image captioning models and causing hallucinations. To tackle this, we present DiffuRank, a method that leverages a pre-trained text-to-3D model to assess the alignment between 3D objects and their 2D rendered views, where the view with high alignment closely represent the object’s characteristics. By ranking all rendered views and feeding the top-ranked ones into GPT4-Vision, we enhance the accuracy and detail of captions, enabling the correction of 200k captions in the Cap3D dataset and extending it to 1 million captions across Objaverse and Objaverse-XL datasets. Additionally, we showcase the adaptability of DiffuRank by applying it to pre-trained text-to-image models for a Visual Question Answering task, where it outperforms the CLIP model.