Context by Cohere
Emerging Trends in Generative AI Research: A Selection of Recent Papers

Emerging Trends in Generative AI Research: A Selection of Recent Papers

Stay at the forefront of NLP advances with Cohere For AI's recent community-curated research 🔍🧠


TL;DR: Explore some of the top recent NLP papers, curated by Cohere For AI, covering topics like data pruning, mixture of experts, software portability, chain of thought reasoning, and more. Stay updated in the fast-evolving NLP field, and consider joining Cohere's research community.

C4AI research Discord community members highlighted these papers. We thank @alon, @mohamdy, @domenicrosati, @EIFY, Sara Hooker, and the rest of the Cohere For AI research community for participating!

The Data Provenance Initiative: A Large Scale Audit of Dataset Licensing & Attribution in AI

Authors: Shayne Longpre, Robert Mahari, Anthony Chen, Naana Obeng-Marnu, Damien Sileo, William Brannon, Niklas Muennighoff, Nathan Khazam, Jad Kabbara, Kartik Perisetla, Xinyi Wu, Enrico Shippole, Kurt Bollacker, Tongshuang Wu, Luis Villa, Sandy Pentland, Sara Hooker

TL;DR: This research audited over 1,800 AI datasets, revealing widespread miscategorization and omission of licenses, and dividing data availability under different licensing conditions. The article introduces tools like the Data Provenance Explorer to enhance transparency and legal compliance, addressing challenges in dataset attribution and responsible AI development.


This paper addresses the critical issue of data transparency and legality in the training of language models. The authors conducted a comprehensive audit of over 1,800 text datasets and developed tools and standards to trace their lineage, including source creators, license conditions, properties, and usage. Called the Data Provenance Initiative, this audit reveals a significant divide in the data types available under different licensing conditions, particularly between commercially open and closed datasets. Closed datasets tend to monopolize more diverse and creative sources, including lower-resource languages and newer, synthetic training data.

A key issue identified is the frequent miscategorization of licenses on popular dataset hosting sites, with omissions of over 70% and error rates above 50%. This data attribution and licensing crisis complicates the responsible use of datasets in AI development. The paper introduces tools like the Data Provenance Explorer and Data Provenance Cards to assist practitioners in navigating these complexities, enhancing dataset transparency and legal compliance.

The authors’ empirical analysis shows a wide diversity in license types, with a high prevalence of “unspecified” licenses on crowdsourced aggregators. They also highlight the challenges posed by restrictive licenses requiring attribution and share-alike clauses, often leading to misattribution in practice. 

The paper underscores the importance of accurate data documentation and attribution in AI development, promoting responsible and legally sound practices.

Illustration of the DPCollection annotation pipeline.
The DPCollection annotation pipeline uses human and human-assisted procedures to annotate dataset Identifiers, Characteristics, and Provenance. (source:

Which Prompts Make the Difference? Data Prioritization for Efficient Human LLM Evaluation

Authors: Meriem Boubdir, Edward Kim, Beyza Ermis, Marzieh Fadaee, Sara Hooker

TL;DR: This paper presents a method to enhance the efficiency of human evaluations of large language models (LLMs) by using metrics like KL divergence and Cross-Entropy to prioritize prompts, resulting in up to 54 percent fewer tie outcomes and improved Elo score robustness, thus reducing time and cost.


The paper addresses the challenge of efficiently evaluating large language models (LLMs) through human annotation. Traditional evaluation metrics often fail to capture the nuances of natural language, necessitating human evaluation. However, this process is resource-intensive in terms of time and cost. The study focuses on reducing the required human annotations by prioritizing prompts that effectively differentiate between models. Prioritization occurs by employing metrics like KL divergence and Cross-Entropy to rank prompts based on their ability to generate decisive preference outcomes, thus reducing indecisive (or “tie”) outcomes in evaluations.

The methodology involves comparing model completions for each prompt pairwise, assessing their dissimilarity using these metrics. Then, the prompts are arranged in order of increasing tie likelihood. The paper details the experimental setup, including selecting prompts and models, completing generation, and collecting human annotation.

Results show that this method effectively reduces tie outcomes by up to 54% in the top 20% of prioritized prompts compared to random selection. It also enhances the stability of Elo scores, a metric for evaluating performance in zero-sum games, thus reducing reliance on extensive human annotations. 

The study demonstrates significant efficiency gains, particularly when comparing models within the same family. However, it acknowledges limitations, such as the potential for bias by over-representing certain challenges while under-representing consistent model outputs.​

A side-by-side comparison of a prompt and the completions from models A and B.
Comparison of a prompt, used to rank completions from models A and B. (source:

Goodtriever: Adaptive Toxicity Mitigation with Retrieval-Augmented Models

Authors: Luiza Pozzobon, Beyza Ermis, Patrick Lewis, Sara Hooker

TD;DR: GOODTRIEVER is an innovative method for mitigating toxicity in language models using retrieval-augmented techniques. It efficiently adapts to evolving language and toxicity, significantly reducing inference time without compromising performance. The approach remains effective across various models and in continual toxicity mitigation, focusing on future expansion to multilingual contexts.


This paper introduces a novel method, GOODTRIEVER, for mitigating toxicity in language models (LMs). This approach addresses the limitations of existing methods, which are computationally intensive and require significant modifications to model parameters. GOODTRIEVER stands out due to its adaptability to the evolving nature of language and toxicity, leveraging retrieval-augmented models to control text generation based on desired attributes.

GOODTRIEVER incorporates two external datastores containing toxic and non-toxic examples. During inference, it combines the next-token probabilities from the language model with those from the datastores, using a Product of Experts (PoE) method to control the impact of datastores on the final output. This approach ensures that a token is selected based on its probability of being non-toxic according to both the language model and the datastores.

The paper evaluates GOODTRIEVER’s effectiveness across various model sizes and families, demonstrating its efficiency in mitigating toxicity without significant computational overhead. It shows a 43% reduction in inference time compared to state-of-the-art methods while maintaining comparable performance in toxicity mitigation. GOODTRIEVER is tested against various baselines and in different settings, including continual toxicity mitigation, where it adapts to new types of toxicity while maintaining effectiveness for previously encountered domains.

Despite its effectiveness, the paper acknowledges limitations, including reliance on the subjective definitions of toxicity and the potential for bias amplification. The method’s adaptability to multilingual and multicultural contexts is also highlighted as an area for future development.

An illustration of GOODTRIEVER.
An illustration of GOODTRIEVER. (source:

Locally Differentially Private Document Generation Using Zero Shot Prompting

Authors: Saiteja Utpala, Sara Hooker, Pin Yu Chen

TL:DR: While LLMs have shown numerous privacy risks, this paper develops a method for using LLMs to preserve privacy. In this paper, the authors introduce a mechanism called DP-Prompt which reduces the problem of author identification. In short, DP-Prompt generates a sanitized (very similar) article to the original, but with higher differential privacy. In other words, it generates an article which carries the same meaning, but in which the author is harder to identify by a deanonymization attack.


Due to the large amounts of data and the power of LLMs, the authors of articles are identifiable by machine learning models using linguistic patterns that are characteristic of each individual author. Several times, datasets have been anonymized by removing personally identifiable data, but attackers have been able to identify several of the individuals due to their writing style. In this paper, the authors address this problem by proposing a framework called DP-Prompt, which measures the solution using Differential Privacy (DP), the de facto standard way for quantifying the privacy of a dataset.

The way DP-Prompt works is the following. A LLM is prompted to generate paraphrases, which are then released as sanitized documents. There are two reasons for picking this procedure.

  1. Paraphrasing has been shown to be a robust defense mechanism against deanonymization attacks.
  2. Pretrained large language models have shown strength for tackling complex tasks without the need of task-specific and expensive fine-tuning.

DP-Prompt has shown great results by keeping the sentiment of the original document (measured using F1-score), and vastly reducing the accuracy of author deanonymization attacks. This has been shown by conducting experiments with six open source models, ranging up to seven billion parameters.

The DP-Prompt pipeline. (source:

Sheared LLaMA: Accelerating Language Model Pre-Training Via Structured Pruning

Authors: Mengzhou Xia, Tianyu Gao, Zhiyuan Zeng, Danqi Chen

TL;DR: This paper presents LLM-shearing, a method for efficiently creating smaller, competitive large language models (LLMs) using structured pruning and dynamic batch loading. It successfully prunes a 7B parameter model into smaller versions that outperform similar-sized models, demonstrating cost-effectiveness and potential scalability for future model development.


This paper introduces an innovative approach to developing smaller yet powerful large language models (LLMs) through structured pruning, a cost-effective alternative to training models from scratch. The method, called LLM-shearing, employs two key strategies.

  • Targeted Structured Pruning: This novel algorithm prunes a larger pre-trained model to a specified target architecture, efficiently removing redundant components like layers, heads, and dimensions. This targeted pruning preserves model performance while achieving a compact, efficient structure.
  • Dynamic Batch Loading: To address the imbalanced learning rates across different domains in the pruned models, dynamic batch loading adjusts the training data composition based on the loss reduction rate in each domain. This technique improves training efficiency and accelerates overall performance improvement.

The effectiveness of this approach is demonstrated with the Sheared-LLaMA series, where an LLaMA2-7B model is pruned down to 1.3B and 2.7B parameters. These pruned models outperform state-of-the-art models of similar sizes on various tasks while requiring only 3% of the computational resources needed to train equivalent models from scratch.

Dynamic batch loading balances loss reduction across domains, leading to more efficient data usage and improved downstream performance. The Sheared-LLaMA models also exhibit higher inference throughput than models with non-uniform layer configurations, like CoFiPruning.

The work acknowledges limitations, such as reliance on available pre-training datasets and large models and the current restriction to a 7B parameter model. However, the method’s scalability promises applicability to larger models in future research. The paper highlights structured pruning as a viable path for producing competitive, small LLMs at a lower cost.

An illustration demonstrating how targeted structured pruning produces a compact and dense model of a pre-specified shape
Targeted structured pruning produces a compact and dense model of a pre-specified shape. (source:

Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection

Authors: Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, Hannaneh Hajishirzi

TL;DR: SELF-RAG is a new framework that enhances LLMs’ quality and factuality through on-demand retrieval and self-reflection. It uses reflection tokens for self-evaluation, significantly outperforming conventional LLMs across various tasks. This approach offers tailored model behavior and improved factuality without additional training, marking a significant advancement in LLM capabilities.


This paper discusses self-reflective retrieval-augmented generation (SELF-RAG), a novel framework designed to enhance large language models (LLMs) regarding quality and factuality. Traditional LLMs struggle with factual inaccuracies, partly due to their sole reliance on parametric knowledge. Though they support error reduction, existing retrieval-augmented generation (RAG) methods often retrieve irrelevant information, compromising the versatility and output quality of LLMs.

SELF-RAG addresses these issues by enabling LLMs to adaptively retrieve relevant information and reflect on it through specially designed reflection tokens. These tokens allow the model to self-evaluate its output for relevance, support, and completeness. Unlike conventional RAG methods, SELF-RAG does not indiscriminately retrieve information but tailors its behavior according to the task requirements, improving the factual accuracy and quality of the generated content.

The framework involves two components: a generator and a critic model. The generator model, trained on a dataset augmented with reflection tokens and retrieved passages, predicts the next output tokens, including reflection tokens. The critic model assesses retrieved passages’ relevance and support and the generated output’s quality. This training allows SELF-RAG to function without the critic model during inference, reducing computational overhead.

Empirical evaluations across six tasks demonstrate that SELF-RAG outperforms pre-trained and instruction-tuned LLMs, including those with more parameters. It performed better in open-domain question answering, reasoning, fact verification, and long-form generation tasks, achieving higher factuality and citation accuracy. Additionally, SELF-RAG’s framework allows for customized model behavior at test time without additional training.

An illustration providing an overview of SELF-RAG.
Overview of SELF-RAG, a framework that learns to retrieve, critique, and generate text passages to enhance overall generation quality, factuality, and verifiability. (source:

BitNet: Scaling 1-Bit Transformers for Large Language Models

Authors: Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang, Lingxiao Ma, Fan Yang, Ruiping Wang, Yi Wu, Furu Wei

TL;DR: BitNet, a 1-bit Transformer architecture, efficiently scales large language models (LLMs) by reducing energy consumption and memory footprint. It achieves competitive performance compared to conventional models, demonstrating higher training stability and scaling efficiency. BitNet’s potential for larger model applications offers a sustainable and efficient path for LLM development.


This paper introduces BitNet, a novel 1-bit Transformer architecture optimized for large language models (LLMs). Addressing the challenges of high energy consumption and large memory footprint associated with LLMs, BitNet offers a scalable and efficient solution. It employs BitLinear, a replacement for the standard linear layer in Transformers, enabling the training of 1-bit weights from scratch. This approach markedly reduces memory use and energy requirements while maintaining competitive performance compared to 8-bit quantization methods and FP16 Transformer baselines.

BitNet follows the conventional Transformer layout with self-attention and feed-forward networks, using binarized model weights for matrix multiplication. The weights are binarized using the signum function, and activations are quantized to 8-bit precision. Group Quantization and Normalization techniques are incorporated for efficient model parallelism, optimizing computational efficiency in energy and memory usage.

The training of BitNet utilizes the straight-through estimator for backpropagation and employs mixed-precision training to ensure stability and accuracy. BitNet demonstrates higher training stability than FP16 Transformers, effectively utilizing larger learning rates.

Comparative analyses against other quantization methods reveal that BitNet consistently outperforms baselines across various tasks and model sizes, particularly at lower bit levels. It achieves superior zero-shot and few-shot performance on standard datasets like Winogrande, Winograd, Storycloze, and Hellaswag. Significantly, BitNet follows a scaling law similar to full-precision Transformers, suggesting its potential for effective scaling to even larger models while retaining efficiency and performance benefits. Future work aims to scale BitNet further and explore its application in other architectures like RetNet.

The computation flow of BitLinear and the architecture of BitNet.
The computation flow of BitLinear and the architecture of BitNet. (source:

Controlled Decoding from Language Models

Authors: Sidharth Mudgal, Jong Lee, Harish Ganapathy, YaGuang Li, Tao Wang, Yanping Huang, Zhifeng Chen, Heng-Tze Cheng, Michael Collins, Trevor Strohman, Jilin Chen, Alex Beutel, Ahmad Beirami

TL;DR: Controlled Decoding (CD) aligns language models with specific goals using off-policy reinforcement learning and a prefix scorer. It enables fine-tuned, efficient control in token-wise or blockwise manners, improving outcomes like dialog safety and length. CD’s flexibility and effectiveness make it a promising approach for responsible AI alignment.


This paper explains Controlled Decoding (CD), a novel method for aligning language models (LMs) with specific objectives, such as safety and factuality. CD utilizes off-policy reinforcement learning to control the autoregressive generation of LMs towards high-reward outcomes. A key component of CD is a prefix scorer — a value function used at inference time to steer the generation towards desired results.

CD operates under a KL-regularized reinforcement learning framework, balancing the objective of achieving higher rewards with the need to minimize divergence from the base LM’s policy. The prefix scorer is trained to predict expected rewards for partially decoded responses. It uses off-policy data and the Bellman update, differing significantly from on-policy reinforcement learning methods.

The paper demonstrates two operational modes of CD: token-wise sampling and a novel blockwise sampling strategy. Token-wise sampling involves adjusting the generation at each token. In contrast, blockwise sampling involves sampling and selecting entire blocks of text based on the prefix scorer’s evaluation, bridging the gap between best-of-K strategy and token-level control.

Experiments on the DSTC8 Reddit conversations corpus show CD’s effectiveness in improving dialog safety and length. CD offers flexibility in handling multiple objectives, allowing real-time adjustment of reward priorities during inference. The blockwise CD variant consistently achieves a better trade-off between reward and KL divergence, indicating its potential for practical LM alignment.

In summary, CD provides a robust framework for aligning LMs with complex objectives in an efficient, modular manner. However, applying these alignment techniques requires careful consideration, especially in sensitive areas like safety.

An illustration of token-wise sampling using CD prefix scorer where the alignment goal is to decode sequences with positive sentiment.
An illustration of token-wise sampling using CD prefix scorer, where the alignment goal is to decode sequences with positive sentiment. (source: 

Representation Engineering: A Top-Down Approach to AI Transparency

Authors: Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J. Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J. Zico Kolter, Dan Hendrycks

TL;DR: Representation Engineering (RepE) enhances AI transparency by focusing on high-level cognitive phenomena in neural networks. It involves two main components: reading and controlling representations. Applied to large language models, RepE effectively enhances honesty, showcasing its potential in addressing safety-relevant issues and advancing AI transparency and control.


This paper discusses Representation Engineering (RepE), an innovative approach to enhancing AI transparency by focusing on high-level cognitive phenomena in deep neural networks (DNNs). RepE, drawing from cognitive neuroscience insights, emphasizes the importance of representations over individual neurons or circuits in understanding and controlling AI systems. The methodology consists of two main components: Representation Reading and Representation Control.

Representation Reading aims to identify emergent representations of high-level concepts and functions within a network, making models more amenable to concept extraction, knowledge discovery, and monitoring. This process involves a baseline technique, Linear Artificial Tomography (LAT), which comprises designing stimuli and tasks, collecting neural activity, and constructing a linear model.

Building on insights from Representation Reading, Representation Control seeks to modify or control these internal representations. It introduces baseline transformations, such as the Contrast Vector, which involves running the same input through the model with different prompts and using the resultant representation differences. The methods include linear combination, piece-wise operation, and projection to control representations.

One application of RepE is enhancing honesty in large language models (LLMs). By applying LAT to datasets of true and false statements, a consistent internal concept of truthfulness is extracted and evaluated across tasks, including standard benchmarks and datasets containing imitative falsehoods. This approach leads to state-of-the-art results in detecting and controlling honesty in LLMs, demonstrating the potential of RepE in addressing a range of safety-relevant issues in AI systems.

An example of the LAT baseline aimed to extract neural activity related to the target concept or function.
An example of the LAT baseline aimed to extract neural activity related to the target concept or function. (source:

GateLoop: Fully Data-Controlled Linear Recurrence for Sequence Modeling

Author: Tobias Katsch

TL;DR: GateLoop, a new sequence model, advances linear recurrence by introducing data-controlled state transitions. It outperforms existing models in language modeling tasks, combining efficiency with dynamic memory control. Its innovative approach enhances sequence modeling, offering a promising direction for future deep learning and natural language processing (NLP) research.


The paper introduces GateLoop, a foundational sequence model that enhances the capabilities of linear recurrence for modeling long sequences. GateLoop generalizes existing linear recurrent models, such as S4, S5, LRU, and RetNet, by incorporating data-controlled state transitions. This innovation allows for efficient and effective auto-regressive language modeling, demonstrating superiority over state-of-the-art models like Transformer and Hyena.

GateLoop addresses the limitations of Recurrent Neural Networks (RNNs) and their derivatives, which struggle with long-range dependencies due to the vanishing and exploding gradient problem. Unlike Transformers, which eliminate recurrence in favor of attention mechanisms, GateLoop retains linear recurrence’s benefits while overcoming its limitations.

The core of GateLoop involves data-controlled gating of inputs, hidden states, and outputs, enabling dynamic control over memory retention and forgetting. This approach differs significantly from existing models, which typically lack such data-controlled gating mechanisms. GateLoop can operate in low-cost O(l) recurrent mode, efficient O(l log l) parallel mode, and an O(l^2) surrogate attention mode, providing flexibility and efficiency in different contexts.

Experimental results confirm the efficacy of GateLoop. On the WikiText-103 benchmark, a standard for autoregressive natural language modeling, GateLoop significantly outperforms existing models. The paper also includes a synthetic task, Memory Horizon, designed to demonstrate the advantages of data-controlled state transitions. This task highlights GateLoop’s ability to effectively manage memory based on input, a crucial factor in practical sequence modeling.

The GateLoop framework takes input-dependent values V, keys K, queries Q and state transitions A.
The GateLoop framework takes input-dependent values V, keys K, queries Q, and state transitions A. (source:
Keep reading