LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders
Authors: Parishad BehnamGhader, Vaibhav Adlakha, Marius Mosbach, Dzmitry Bahdanau, Nicolas Chapados, Siva Reddy
What
This paper introduces LLM2Vec, an unsupervised approach for converting large decoder-only language models (LLMs) into effective text encoders by enabling bidirectional attention, incorporating masked next token prediction training, and applying unsupervised contrastive learning.
Why
This paper is important because it addresses the limitations of causal attention in decoder-only LLMs for text embedding tasks, offering a simple and efficient method to enhance their performance and compete with or surpass encoder-only models.
How
The authors develop LLM2Vec, a three-step approach consisting of: 1) enabling bidirectional attention in decoder-only LLMs, 2) adapting the models using masked next token prediction (MNTP) training, and 3) enhancing sequence representation learning through unsupervised contrastive learning with SimCSE. They apply LLM2Vec to three LLMs (Sheared-LLaMA-1.3B, Llama-2-7B-chat, and Mistral-7B-Instruct-v0.2) and evaluate their performance on word- and sequence-level tasks using benchmarks such as CoNLL-2003 and MTEB.
Result
LLM2Vec-transformed models demonstrate substantial improvements on both word- and sequence-level tasks. Notably, they outperform strong encoder-only baselines on word-level tasks and achieve state-of-the-art results among unsupervised models on the MTEB benchmark. The authors also find that Mistral models inherently possess a degree of bidirectional attention, contributing to their strong performance.
LF
The authors acknowledge limitations regarding the computational demands of large LLMs and potential data contamination from pre-training. Future work could focus on mitigating these limitations by exploring techniques for efficient training and inference of large models, and evaluating on novel benchmarks to address data contamination concerns. Additionally, extending LLM2Vec to other languages beyond English presents a promising research direction.
Abstract
Large decoder-only language models (LLMs) are the state-of-the-art models on most of today’s NLP tasks and benchmarks. Yet, the community is only slowly adopting these models for text embedding tasks, which require rich contextualized representations. In this work, we introduce LLM2Vec, a simple unsupervised approach that can transform any decoder-only LLM into a strong text encoder. LLM2Vec consists of three simple steps: 1) enabling bidirectional attention, 2) masked next token prediction, and 3) unsupervised contrastive learning. We demonstrate the effectiveness of LLM2Vec by applying it to 3 popular LLMs ranging from 1.3B to 7B parameters and evaluate the transformed models on English word- and sequence-level tasks. We outperform encoder-only models by a large margin on word-level tasks and reach a new unsupervised state-of-the-art performance on the Massive Text Embeddings Benchmark (MTEB). Moreover, when combining LLM2Vec with supervised contrastive learning, we achieve state-of-the-art performance on MTEB among models that train only on publicly available data. Our strong empirical results and extensive analysis demonstrate that LLMs can be effectively transformed into universal text encoders in a parameter-efficient manner without the need for expensive adaptation or synthetic GPT-4 generated data.