Du et al. write: “The shortcut learning behavior has significantly affected the robustness of [large language models, LLMs].” Predictions in these models “rely on dataset artifacts and biases within the hypothesis sentence.” LLMs (GPT-3 and T5) use a prompt-based training paradigm, where a snippet of text is provided to the LLM as input. The LLM is expected to provide a relevant completion of this input, for example, the word “Hello” could be a snippet for: “Hello, how can I help you?”
This article presents a comprehensive performance review of “the shortcut learning problem in the pre-training and fine-tuning training paradigm of medium-sized language models (typically with less than a billion parameters).” A model’s lack of robustness is attributed to biases in the training data, for example, shortcut refers to the training based on non-robust features, as it fails to capture robust features and high-level semantics. The non-robust features are helpful in the generalization for development and test sets as long as the patterns in the new data are similar to the patterns in the data the model was trained on.
Comparisons between LLMs of similar architecture but different sizes, for example, BERT-base with BERT-large and RoBERTa-base with RoBERTa-large, show that large versions generate consistently better than base versions, with a small accuracy gap between out-of-distribution (OOD) and independent and identically distributed (IID) test data. This shows that “smaller models are more prone to capture spurious patterns and are more dependent on data artifacts for prediction.” Standard training procedures are bias toward learning simple features, referred to as simplicity bias, and remain invariant to complex predictive features. The authors explain:
Models tend to learn non-robust and easy-to-learn features at the early stage of training. For example, reading comprehension models have learned the shortcut in the first few training iterations, which has influenced further exploration of the models for more robust features.
Additionally, “the present LLM training methods can be considered as data-driven, corpus-based, statistical, and machine-learning approaches.” While a data-driven approach may be good for certain natural language processing (NLP) tasks, “it falls short in relevance to the challenging NLU tasks that necessitate a deeper understanding of natural language.” IID performance is on par with human performance, but OOD is far below both human and IID; “debiased algorithms are thought to achieve better generalization because they can learn more robust features than biased models.” The study on robustness of prompt-based huge-sized language models such as GPT-3 and GPT-2 found that LLMs are susceptible to majority label bias and proposition bias, that is, “they tend to predict answers based on the frequency or position of the answers in the training data.”
The article concludes that the current standard of data-driven training results in models that perform low-level pattern recognition, which are useful for low-level NLP tasks. For more difficult natural language understanding (NLU) tasks, it is necessary to introduce “more inductive bias into the model architecture to improve robustness and generalization beyond IID benchmark datasets,” as well as “more human-like common sense knowledge into the model training.” Furthermore, “the current pure data-driven training paradigm for LLMs is insufficient for high-level natural language understanding.” To achieve that, a future data-driven paradigm “should be combined with domain knowledge at every stage of model design and evaluation.”
Overall, the article provides a unique and precise review of the current state of research in NLU and LLMs.