Object Recognition as Next Token Prediction

Authors: Kaiyu Yue, Bor-Chun Chen, Jonas Geiping, Hengduo Li, Tom Goldstein, Ser-Nam Lim

What

This paper presents a novel approach to object recognition by framing it as a next token prediction problem, utilizing a language decoder to auto-regressively predict object labels from image embeddings.

Why

The paper is significant because it offers an open-vocabulary object recognition method that eliminates the need for predefined object labels or descriptions, unlike traditional linear classifiers and contrastive frameworks. It proposes an efficient and innovative one-shot sampling method for parallel label generation and introduces a compact decoder for enhanced efficiency.

How

The authors employ a pretrained CLIP image encoder to generate image embeddings and a truncated language decoder (derived from LLaMA) to predict labels auto-regressively. They introduce a non-causal attention mask to decouple tokens from different labels and treat image tokens as a prefix. The one-shot sampling method enables parallel label token generation, while the compact decoder enhances efficiency. The method is trained on large-scale image-caption pairs and evaluated using a semantic similarity-based metric.

Result

Key findings include the effectiveness of one-shot sampling for generating diverse labels in parallel, outperforming traditional greedy and beam search methods. The truncated language decoder achieves comparable performance to the full model while being significantly faster. The method surpasses existing open-vocabulary recognition approaches in recall and achieves competitive performance in precision, demonstrating its ability to generate highly relevant labels.

LF

The authors acknowledge limitations in training data quality and evaluation metrics. They suggest future work exploring methods to train models with fewer labels, refining the label definition, developing better evaluation metrics, and adapting the approach for fine-grained recognition tasks.

Abstract

We present an approach to pose object recognition as next token prediction. The idea is to apply a language decoder that auto-regressively predicts the text tokens from image embeddings to form labels. To ground this prediction process in auto-regression, we customize a non-causal attention mask for the decoder, incorporating two key features: modeling tokens from different labels to be independent, and treating image tokens as a prefix. This masking mechanism inspires an efficient method - one-shot sampling - to simultaneously sample tokens of multiple labels in parallel and rank generated labels by their probabilities during inference. To further enhance the efficiency, we propose a simple strategy to construct a compact decoder by simply discarding the intermediate blocks of a pretrained language model. This approach yields a decoder that matches the full model’s performance while being notably more efficient. The code is available at https://github.com/kaiyuyue/nxtp