RLIPv2: Fast Scaling of Relational Language-Image Pre-training

Authors: Hangjie Yuan, Shiwei Zhang, Xiang Wang, Samuel Albanie, Yining Pan, Tao Feng, Jianwen Jiang, Dong Ni, Yingya Zhang, Deli Zhao

What

This paper introduces RLIPv2, an improved model for Relational Language-Image Pre-training (RLIP) that focuses on fast convergence and scalability by using large-scale pseudo-labelled scene graph data.

Why

This research is important because it addresses the limitations of RLIPv1, the previous iteration, which struggled with slow convergence and limited scene graph data. By enabling efficient scaling, RLIPv2 pushes the boundaries of relational reasoning in computer vision, achieving state-of-the-art results in tasks like HOI detection and scene graph generation.

How

The authors introduce Asymmetric Language-Image Fusion (ALIF) for faster convergence, employing sparse language encoding and early fusion. To generate large-scale pseudo-labelled scene graph data, they combine object detection datasets with BLIP-generated captions and a Relation Tagger built on RLIPv2 itself. They conduct extensive experiments on datasets like HICO-DET, V-COCO, and Open Images v6, comparing various settings like zero-shot, few-shot, and fully-finetuned learning.

Result

RLIPv2 demonstrates superior performance across HOI detection and Scene Graph Generation benchmarks. Notably, it achieves state-of-the-art results on Open Images v6 for SGG and impressive zero-shot, few-shot, and fully-finetuned results on HICO-DET, demonstrating significant data efficiency and exceeding previous methods. For example, the largest RLIPv2 achieves 23.29mAP on HICO-DET without fine-tuning, 32.22mAP with 1% data, and 45.09mAP with full data.

LF

The authors acknowledge the reliance on external captioner quality as a limitation, where noisy captions can impact performance. Future work includes exploring advanced captioning techniques for higher-quality pseudo-labels and investigating methods to overcome challenges posed by complex scenes with multiple similar objects.

Abstract

Relational Language-Image Pre-training (RLIP) aims to align vision representations with relational texts, thereby advancing the capability of relational reasoning in computer vision tasks. However, hindered by the slow convergence of RLIPv1 architecture and the limited availability of existing scene graph data, scaling RLIPv1 is challenging. In this paper, we propose RLIPv2, a fast converging model that enables the scaling of relational pre-training to large-scale pseudo-labelled scene graph data. To enable fast scaling, RLIPv2 introduces Asymmetric Language-Image Fusion (ALIF), a mechanism that facilitates earlier and deeper gated cross-modal fusion with sparsified language encoding layers. ALIF leads to comparable or better performance than RLIPv1 in a fraction of the time for pre-training and fine-tuning. To obtain scene graph data at scale, we extend object detection datasets with free-form relation labels by introducing a captioner (e.g., BLIP) and a designed Relation Tagger. The Relation Tagger assigns BLIP-generated relation texts to region pairs, thus enabling larger-scale relational pre-training. Through extensive experiments conducted on Human-Object Interaction Detection and Scene Graph Generation, RLIPv2 shows state-of-the-art performance on three benchmarks under fully-finetuning, few-shot and zero-shot settings. Notably, the largest RLIPv2 achieves 23.29mAP on HICO-DET without any fine-tuning, yields 32.22mAP with just 1% data and yields 45.09mAP with 100% data. Code and models are publicly available at https://github.com/JacobYuan7/RLIPv2.