NEFTune: Noisy Embeddings Improve Instruction Finetuning

Authors: Neel Jain, Ping-yeh Chiang, Yuxin Wen, John Kirchenbauer, Hong-Min Chu, Gowthami Somepalli, Brian R. Bartoldson, Bhavya Kailkhura, Avi Schwarzschild, Aniruddha Saha, Micah Goldblum, Jonas Geiping, Tom Goldstein

What

This paper introduces NEFTune, a simple yet effective technique for improving instruction fine-tuning of large language models (LLMs) by adding noise to embedding vectors during training.

Why

This paper is important because it presents a novel approach to enhance the performance of instruction-tuned LLMs, addressing the critical need for efficient use of limited instruction datasets in LLM training.

How

The authors employ NEFTune, which involves adding scaled uniform noise to the embedding vectors during the forward pass of fine-tuning. They evaluate NEFTune’s impact on various LLM architectures, including LLaMA-1, LLaMA-2, and OPT, using different instruction-tuning datasets like Alpaca, Evol-Instruct, ShareGPT, and OpenPlatypus. The evaluation leverages AlpacaEval and OpenLLM Leaderboard tasks to assess the conversational quality and factual accuracy of the models.

Result

NEFTune significantly improves the performance of LLMs across different model sizes and datasets, leading to more fluent and informative responses. Notably, it exhibits an average improvement of 15% in AlpacaEval Win Rate. Additionally, the authors find that NEFTune helps mitigate overfitting to the instruction datasets, allowing the models to generalize better and generate more human-like responses.

LF

The authors acknowledge limitations such as reliance on AlpacaEval and limited computational resources for evaluating larger models. Future work includes exploring the impact of NEFTune on model safety and reliability, investigating its effectiveness with larger model variants (e.g., 70B parameters) across multiple datasets, and gaining a deeper understanding of the underlying mechanisms by which NEFTune improves performance.

Abstract

We show that language model finetuning can be improved, sometimes dramatically, with a simple augmentation. NEFTune adds noise to the embedding vectors during training. Standard finetuning of LLaMA-2-7B using Alpaca achieves 29.79% on AlpacaEval, which rises to 64.69% using noisy embeddings. NEFTune also improves over strong baselines on modern instruction datasets. Models trained with Evol-Instruct see a 10% improvement, with ShareGPT an 8% improvement, and with OpenPlatypus an 8% improvement. Even powerful models further refined with RLHF such as LLaMA-2-Chat benefit from additional training with NEFTune.