First Tragedy, then Parse: History Repeats Itself in the New Era of Large Language Models
Authors: Naomi Saphra, Eve Fleisig, Kyunghyun Cho, Adam Lopez
What
This paper examines the “scale crisis” in NLP research, where the dominance of large language models (LLMs) trained on massive datasets challenges the relevance of research from smaller groups. By reflecting on the history of statistical machine translation (SMT) and its own era of LLMs, the authors argue that the current crisis is transient and propose research directions for meaningful contributions even in the age of massive models.
Why
The paper addresses the widespread anxiety among NLP researchers about the impact of LLMs on the field. It provides historical context and practical guidance for navigating the challenges and opportunities presented by the current research landscape.
How
The authors analyze the trajectory of SMT, particularly the rise and fall of large n-gram models, drawing parallels to the current era of LLMs. They use this historical analysis to identify durable lessons and evergreen research problems relevant to the present situation.
Result
The paper highlights that scale disparities are often temporary, as demonstrated by the eventual accessibility of large-scale SMT systems in the past. It argues that data remains a significant bottleneck, especially for low-resource languages, and emphasizes the crucial need for improved evaluation metrics that accurately capture model performance beyond simple benchmarks.
LF
The paper acknowledges its limitations in predicting the future of NLP research. It suggests future work should focus on improving evaluation metrics, developing algorithms for future hardware, exploring new paradigms that might supersede current LLMs, and addressing ethical considerations related to data bias and human evaluation.
Abstract
Many NLP researchers are experiencing an existential crisis triggered by the astonishing success of ChatGPT and other systems based on large language models (LLMs). After such a disruptive change to our understanding of the field, what is left to do? Taking a historical lens, we look for guidance from the first era of LLMs, which began in 2005 with large -gram models for machine translation (MT). We identify durable lessons from the first era, and more importantly, we identify evergreen problems where NLP researchers can continue to make meaningful contributions in areas where LLMs are ascendant. We argue that disparities in scale are transient and researchers can work to reduce them; that data, rather than hardware, is still a bottleneck for many applications; that meaningful realistic evaluation is still an open problem; and that there is still room for speculative approaches.