If you work in NLP, it's important to keep up to date with the latest research. In this post, we look at some of the best papers on NLP that were published in November 2022!
- This roundup highlights some interesting NLP papers from November 2022 around language model capabilities.
This article’s title and TL;DR have been generated with Cohere.
Get started with text generation
Language models are evolving at a rapid pace, and every month we discover new capabilities. Large language models, like those built by Cohere, are being used for use cases that we couldn’t have imagined even just a few months ago.
In this roundup, we highlight some exciting papers on natural language processing, our work from Cohere For AI and Cohere’s technical staff, along with our involvement at NeurIPS 2022. Topics for this month include different prompting methods for understanding dialogue and humor, use cases like summarization and essay scoring, and what language models learn beyond language.
Have fun reading these! For feedback, please let us know on our Discord community—we’d love to hear from you.
Exciting Papers of the Month
Authors: Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, Colin Raffel
There is a wealth of information on the internet, including birthdays of historical figures and tutorials on how to code, that can be learned by language models. On the other hand, there is a lot of variability in the number of times a piece of information appears online. In this paper, the authors examine the relationship between their pre-training datasets and their knowledge memorized.
Relevant documents are identified by entity linking pre-training datasets and counting documents containing the same entities as a given question-answer pair. In the study, they found strong correlations between accuracy and document count for numerous question-answering datasets.
Authors: Leshem Choshen, Elad Venezian, Shachar Don-Yehia, Noam Slonim, Yoav Katz
In previous studies, finetuned models were found to be better base models than vanilla pretrained ones. By finetuning a model on a source dataset, one may have a better starting point when finetuning a target dataset. The authors analyze this intertraining scheme over a wide range of English classification tasks in this paper.
It turns out that the potential intertraining gain can be analyzed independently for each target dataset under consideration, and for each base model used as a starting point. In contrast to popular perception, alignment between the target dataset and the source dataset that generated the base model is a major determinant of intertraining success.
Authors: Haihao Shen, Ofir Zafrir, Bo Dong, Hengyu Meng, Xinyu Ye, Zhe Wang, Yi Ding, Hanwen Chang, Guy Boudoukh, Moshe Wasserblat
The standard approach to solving natural language processing tasks is transform-based language models. Transformer models cannot be used in production because industry adoption usually requires the maximum throughput to meet certain latency constraints. Model compression techniques such as quantization and pruning may be used to fill this gap.
Despite this, deploying and applying these compression techniques at scale requires specialized software.
The authors propose a new pipeline for creating and running Fast Transformer models on CPUs, using hardware-aware pruning, knowledge distillation, quantization, and a Transformer runtime engine.
Authors: Hattie Zhou, Azade Nova, Hugo Larochelle, Aaron Courville, Behnam Neyshabur, Hanie Sedghi
Through scaling up model and data size, large language models have demonstrated increasing in-context learning capabilities. However, algorithmic reasoning problems are still difficult for LLMs to solve. Even simple algorithmic reasoning tasks such as parity are far from being solved, although providing a rationale with the final answer has improved the performance of multi-step reasoning problems.
A four-stage approach to teaching algorithmic reasoning to LLMs is identified and studied in this work: (1) formulating algorithms as skills, (2) teaching multiple skills simultaneously, (3) teaching skill composition, and (4) teaching skill utilization. The use of algorithmic prompting, which we call in-context learning, is shown to be an effective method of teaching algorithmic reasoning to LLMs.
Authors: Rishi Bommasani, Percy Liang, Tony Lee
We must be measured as language models generate excitement and fear. In order to be able to gain a better understanding of the technology and its societal impact, we need to know what it can and can't do, as well as what risks it poses. A vital first step towards these two goals is transparency.
A new benchmarking method, Holistic Evaluation of Language Models (HELM), has been developed at the Center for Research on Foundation Models to help provide transparency in language modeling. Through collaboration with the broader community, HELM intends to serve as a map of the world of language models that is continuously updated over time.
Recent Work from Cohere For AI and Cohere Technical Staff
Authors: Minqi Jiang, Tim Rocktäschel, Edward Grefenstette
The field of artificial intelligence (AI) is poised to shift from learning from data to learning what data to use. Despite not being completely resolved, large models under unified architectures, such as transformers, have moved the learning bottleneck from training our models to acquiring and using relevant data.
In open-ended domains, such as the real world, exploration is a universal problem in learning. The authors argue that exploration is integral to all learning systems, including supervised learning, despite the fact that the study of exploration in AI has mostly focused on reinforcement learning.
By presenting the generalized exploration problem, which highlights key similarities across learning settings and research challenges, the authors conceptually link exploration-driven learning to reinforcement learning and supervised learning. The process of generalized exploration also provides a promising path to general intelligence by maintaining open-ended learning processes that constantly learn to solve new problems and discover new things.
Authors: Kelechi Ogueji, Orevaoghene Ahia, Gbemileke Onilude, Sebastian Gehrmann, Sara Hooker, Julia Kreutzer
In order to generalize to an increasing number of languages, multilingual models often rely heavily on scaling. Compression techniques are used to reconcile model size growth with real world resource constraints, but compression can adversely affect the performance of low-resource languages. Therefore, understanding the tradeoffs between multilingualism, scale, and compression is crucial.
This study finds that compression confers several interesting and previously unknown generalization properties on mBERT named entity recognition models across 40 languages.
Contrary to prior findings, the authors found that compression can improve model robustness, as well as enhancing low-resource language performance under certain sparsification regimes rather than adversely impacting it.
Cohere at NeurIPS 2022
We are thrilled to be part of NeurIPS this year! Make sure to visit us at booth #615. We’d love to meet you.
Authors: Jesse Mu, Victor Zhong, Roberta Raileanu, Minqi Jiang, Noah Goodman, Tim Rocktäschel, Edward Grefenstette
Reinforcement learning (RL) agents are particularly hard to train when rewards are sparse. One common solution is to use intrinsic rewards to encourage agents to explore their environment. However, recent intrinsic exploration methods often use state-based novelty measures which reward low-level exploration and may not scale to domains requiring more abstract skills. Instead, we explore natural language as a general medium for highlighting relevant abstractions in an environment. Unlike previous work, we evaluate whether language can improve over existing exploration methods by directly extending (and comparing to) competitive intrinsic exploration baselines: AMIGo (Campero et al., 2021) and NovelD (Zhang et al., 2021). These language-based variants outperform their non-linguistic forms by 47-85% across 13 challenging tasks from the MiniGrid and MiniHack environment suites.
Authors: Victor Zhong, Jesse Mu, Luke Zettlemoyer, Edward Grefenstette, Tim Rocktäschel
Recent work has shown that augmenting environments with language descriptions improves policy learning. However, for environments with complex language abstractions, learning how to ground language to observations is difficult due to sparse, delayed rewards. We propose Language Dynamics Distillation (LDD), which pretrains a model to predict environment dynamics given demonstrations with language descriptions, and then fine-tunes these language-aware pretrained representations via reinforcement learning (RL).
In this way, the model is trained to both maximize expected reward and retain knowledge about how language relates to environment dynamics. On SILG, a benchmark of five tasks with language descriptions that evaluate distinct generalization challenges on unseen environments (NetHack, ALFWorld, RTFM, Messenger, and Touchdown), LDD outperforms tabula-rasa RL, VAE pretraining, and methods that learn from unlabeled demonstrations in inverse RL and reward shaping with pretrained experts. In our analyses, we show that language descriptions in demonstrations improve sample-efficiency and generalization across environments, and that dynamics modeling with expert demonstrations is more effective than with non-experts.
Authors: Minqi Jiang, Michael Dennis, Jack Parker-Holder, Andrei Lupu, Heinrich Küttler, Edward Grefenstette, Tim Rocktäschel, Jakob Foerster
Adaptive curricula in reinforcement learning (RL) have proven effective for producing policies robust to discrepancies between the train and test environment. Recently, the Unsupervised Environment Design (UED) framework generalized RL curricula to generating sequences of entire environments, leading to new methods with robust minimax regret properties. Problematically, in partially-observable or stochastic settings, optimal policies may depend on the ground-truth distribution over aleatoric parameters of the environment in the intended deployment setting, while curriculum learning necessarily shifts the training distribution. We formalize this phenomenon as curriculum-induced covariate shift (CICS), and describe how its occurrence in aleatoric parameters can lead to suboptimal policies. Directly sampling these parameters from the ground-truth distribution avoids the issue, but thwarts curriculum learning. We propose SAMPLR, a minimax regret UED method that optimizes the ground-truth utility function, even when the underlying training data is biased due to CICS. We prove, and validate on challenging domains, that our approach preserves optimality under the ground-truth distribution, while promoting robustness across the full range of environment settings.
Authors: Yingchen Xu, Jack Parker-Holder, Aldo Pacchiano, Philip J. Ball, Oleh Rybkin, Stephen J. Roberts, Tim Rocktäschel, Edward Grefenstette
Building generally capable agents is a grand challenge for deep reinforcement learning (RL). To approach this challenge practically, we outline two key desiderata: 1) to facilitate generalization, exploration should be task agnostic; 2) to facilitate scalability, exploration policies should collect large quantities of data without costly centralized retraining. Combining these two properties, we introduce the reward-free deployment efficiency setting, a new paradigm for RL research. We then present CASCADE, a novel approach for self-supervised exploration in this new setting. CASCADE seeks to learn a world model by collecting data with a population of agents, using an information theoretic objective inspired by Bayesian Active Learning. CASCADE achieves this by specifically maximizing the diversity of trajectories sampled by the population through a novel cascading objective. We provide theoretical intuition for CASCADE which we show in a tabular setting improves upon naïve approaches that do not account for population diversity. We then demonstrate that CASCADE collects diverse task-agnostic datasets and learns agents that generalize zero-shot to novel, unseen downstream tasks on Atari, MiniGrid, Crafter and the DM Control Suite. Code and videos are available in this website.
If you’re working with large volumes of text, you can possibly benefit greatly by incorporating large language models into your workflow. It may take some experimentation and tweaking to get the model to do exactly what you want, but these papers should give you an idea of how others go about it.