Context by Cohere
Our Research Discord Community Highlights the Top Papers of September 2023

Our Research Discord Community Highlights the Top Papers of September 2023

Stay at the forefront of NLP advances with Cohere For AI's community-curated research in September 2023 🔍🧠



TL;DR: Dive into the top NLP papers from September 2023, curated by Cohere For AI, which cover topics such as data provenance, toxicity mitigation, controlled decoding, representation engineering, and more. Stay up to date in the fast-evolving NLP field, and consider joining Cohere’s research community.

upload in progress, 0

Generative AI enthusiasts and practitioners, get ready for a thrilling ride as we delve into the latest breakthroughs in natural language processing! Our team at Cohere has worked tirelessly to research and collaborate with our research community to bring you the most up-to-date developments in the Generative AI domain. In this post, we’re excited to give you an overview of some of the latest progress in this fast-evolving field, so you can stay well informed and ahead of the curve.

Cohere is dedicated to making LLMs readily available to both developers and enterprises, so they can unleash their true potential. In pursuit of this mission, we continually seek passionate individuals to join our research community and contribute to the advancement of this innovative technology. By participating in Cohere For AI, you can actively help shape the future of NLP and be a part of a collaborative and groundbreaking journey. We invite you to apply and become an integral member of our thriving research community.

Our Research Discord Community Highlights Some of the Top Recent Papers

C4AI research Discord community members highlighted these papers. We thank @Herumb Shandilya, @mohamdy, Sara Hooker, and the rest of the Cohere For AI research community for participating!

When Less is More: Investigating Data Pruning for Pretraining LLMs at Scale

Authors: Max Marion, Ahmet Üstün, Luiza Pozzobon, Alex Wang, Marzieh Fadaee, Sara Hooker

TL;DR: This paper explores various data pruning methods during large language model (LLM) pretraining. The research finds that perplexity-based pruning, especially while keeping the middle data subset, consistently improves model performance across various model scales and downstream tasks.

The paper delves into an extensive investigation of data pruning methods to optimize the performance of large language models (LLMs) during the pretraining phase. It focuses on three pruning methods — perplexity scores, memorization ranking, and Error L2-Norm (EL2N) scores — applied across multiple data subsets (bottom, middle, and top of the pruning score distribution).

The study reveals that pruning based on perplexity, particularly while retaining the middle data subset, yields superior performance compared to other metrics, and even surpasses the models trained on the entire dataset at specific dataset sizes. Additionally, the size and quality of the reference models used to compute perplexity influence the pruning methods’ effectiveness. Larger reference models provide more effective pruning signals, but, in contrast with other literature, “high-quality” datasets, like Wikipedia, did not.

In further exploring the robustness of perplexity-based pruning, the paper demonstrates that the benefits scale well with larger models, showcasing the method’s potential for broader applications. The study also touches on using early reference model checkpoints, some of which offer adequate signals for effective data pruning, showcasing a potential avenue for saving computational resources.

By underlining the improvements in model performance achieved through careful data pruning, the paper suggests that this method provides a promising, potentially more resource-efficient alternative to merely increasing the training data for enhancing LLM performance. Through these findings, the study contributes to the ongoing efforts toward optimizing the LLM training process, with implications for model training effectiveness and computational efficiency.

Demonstration of pruning methodology (source:

Pushing Mixture of Experts to the Limit: Extremely Parameter Efficient MoE for Instruction Tuning

Authors: Ted Zadouri, Ahmet Üstün, Arash Ahmadian, Beyza Ermiş, Acyr Locatelli, Sara Hooker

TL;DR: This paper introduces and evaluates the Mixture of Vectors (MoV) and Mixture of LoRA (MoLORA) techniques in a Mixture-of-Experts (MoE) framework. The aim is to enhance parameter-efficient, fine-tuning and zero-shot performance on unseen tasks using a text-to-text large language model. The technique achieved comparable results with full fine-tuning while updating less than 1% of the model parameters.

The paper explores enhancing parameter-efficient fine-tuning in large language models (LLMs) using a Mixture-of-Experts (MoE) framework by introducing two novel methods: Mixture of Vectors (MoV) and Mixture of LoRA (MoLORA). These methodologies address challenges associated with scaling instruction-tuned LLMs, especially in environments that are exceptionally limited computationally.

The experiments involved rigorous ablations to understand the effects of various routing strategies and token versus sentence embeddings for routing input. The research also evaluated the impact of the number of experts on downstream performance across multiple model sizes.

The proposed MoE variants (MoV and MoLORA) significantly boost the zero-shot performance on unseen tasks compared to standard, parameter-efficient fine-tuning techniques (PEFTs), like (IA)^3 and LORA, while requiring only a marginal increase in the number of updated parameters. Interestingly, the performance improved or remained on par with full fine-tuning despite updating less than 1% of the 3B and 11B model parameters.

This study underscores the effectiveness of using MoEs in achieving computational efficiency without compromising the accuracy of diverse unseen tasks.

The authors also delved into hyperparameter sensitivity and found that smaller batch sizes and learning rates led to higher performance. While focusing on text-to-text models like T5, the authors identified avenues for future research, such as extending the evaluation to other models like GPT, or exploring efficacy during the pre-training phase.

This work sheds light on how data scientists can use MoEs in a parameter and computationally efficient manner to enhance the LLMs’ instruction-following capabilities on various unseen tasks.

MoV architecture and pseudo code (source:

The Grand Illusion: The Myth of Software Portability and Implications for ML Progress

Authors: Fraser Mince, Dzung Dinh, Jonas Kgomo, Neil Thompson, Sara Hooker

TL;DR: This paper explores the portability of ML frameworks across different devices. More specifically, it delves into the challenges faced by machine learning researchers when trying to combine different hardware and software. The study finds that moving popular machine learning software between different computer systems often leads to problems, with over 40% of key functions not working correctly and performance suffering.

The paper focuses on the concept of “portability,” or how easily machine learning software can be moved from one type of computer system to another. The authors found that popular machine learning frameworks, when transferred to different types of computer hardware, frequently experience a lack of portability. This lack of portability is characterized by the loss of essential functions and significant performance degradation.

In their study, the authors created a collection of functions from popular machine learning software. They found that when using software like Pytorch and TensorFlow on different computer systems, significant issues were encountered. For example, some functions stopped working properly, and even when they did work, they were much slower. The experiments showed that a large percentage of Pytorch and TensorFlow functions experienced failures, while JAX functions performed better. 

Currently, slowdowns and broken operations are common, which hinders innovation by discouraging researchers from exploring new ideas. The authors suggest that standardized approaches to machine learning tools that enhance portability between diverse hardware types are needed. By releasing their benchmark dataset, they hope to encourage greater support and visibility for frameworks that require improvement.

Comparison of average execution time on log scale for TensorFlow, PyTorch, and JAX functions on GPU versus TPU (source:

Headless Language Models: Learning without Predicting with Contrastive Weight Tying

Authors: Nathan Godey, Éric de la Clergerie, Benoît Sagot

TL;DR: The paper presents a novel method of self-supervised pre-training of language models via contrastive weight tying (CWT). The research focuses on reconstructing input embeddings rather than predicting probabilities, significantly reducing training computational requirements while improving downstream performance and data efficiency in monolingual and multilingual contexts.

The discussion delves into how headless language modeling (HLM) with larger token vocabularies optimizes computational efficiency during training without sacrificing performance. Unlike traditional models, HLM’s time and memory complexity remain constant even with increased vocabulary size, which is especially beneficial for multilingual models with extensive vocabularies, like XLM-V.

Using multiple vocabulary sizes and architectures on the CC-News dataset, HLM outperforms its standard counterparts across all vocabulary sizes, exhibiting almost no reduction in training speed with larger vocabularies. Also, tweaking batch sizes impacts training complexity and model performance, as larger batches prove advantageous for HLM.

The unique modeling approach of HLM, focusing on discrimination between co-occurring tokens rather than a contextual hierarchy over the entire vocabulary, potentially facilitates better linguistic relevance and synonym identification.

Despite the promising outcomes, the discussion acknowledges the limitation of not scaling up the experiments due to budget constraints. There is room for future exploration, like integrating HLM with other efficient architectures or evaluating it in encoder-decoder setups.

The text introduces an HLM method using contrastive weight tying (CWT) to address limitations in traditional language modeling, improving training efficiency and task performance. This method accelerates training and reduces compute requirements by eliminating the conventional language modeling projection head. Various experiments show that, since it’s easy to integrate into existing pretraining setups, HLM demonstrates superior data and compute efficiency, outperforming classical models in monolingual and multilingual scenarios.

Classical weight tying vs. contrastive weight tying. (source:

Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding

Authors: Jun Zhang, Jue Wang, Huan Li, Lidan Shou, Ke Chen, Gang Chen, Sharad Mehrotra

TL;DR: This paper introduces self-speculative decoding as a cost-effective, plug-and-play method to speed up large language models (LLMs) without auxiliary models. The technique generates draft tokens quickly in the drafting stage and then verifies them with the original LLM. Benchmarks show up to a 1.73x acceleration.


The paper introduces a novel method, self-speculative decoding, to accelerate the inference speed of large language models (LLMs) like GPT-3/4, PaLM, and LLaMA without auxiliary models. LLMs face significant inference costs due to their autoregressive decoding process, a considerable efficiency bottleneck.

Existing methods to overcome this challenge include model compression techniques and speculative execution, which either alter the model or require additional models, increasing memory overhead. Self-speculative decoding, in contrast, employs the original LLM in two stages: drafting and verification. 

During drafting, the method selectively skips some LLM layers to generate draft tokens quickly, which the unaltered LLM then verifies in the verification stage. This method aims to balance computational efficiency with output quality, and it's optimized using Bayesian optimization to determine which layers to skip.

The paper introduces an adaptive draft-exiting mechanism to enhance computational efficiency. It stops the model from generating draft tokens once the confidence level drops below a specific threshold, not wasting computation on likely-to-be-rejected tokens.

The authors propose the method as a plug-and-play solution, requiring no additional training or memory overhead. They say it achieves up to 1.73x acceleration in end-to-end inference time, based on evaluating text summarization and code generation tasks.

Self-speculative decoding process (source:

Large Language Models as Optimizers

Authors: Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V. Le, Denny Zhou, Xinyun Chen

TL;DR: The paper introduces Optimization by PROmpting (OPRO), which uses large language models (LLMs) to optimize various tasks. This method iteratively generates and evaluates solutions based on natural language prompts, demonstrating enhanced performance over human-designed prompts in multiple problem settings.


The paper delves into the potential of large language models (LLMs) as effective optimizers, especially for instruction following, a critical component of executing higher-order tasks. By examining various optimization setups of initial instructions and temperature settings, the authors unfold how different starting points and optimization parameters affect the LLMs’ performance.

The researchers pay particular attention to the Text-Bison and PaLM 2-L scorer LLMs, which exhibit varying degrees of accuracy based on different initial instructions. This analysis highlights that the choice of initial instructions can significantly impact the optimization trajectory, especially in the early steps, and different temperature settings influence the exploration and exploitation balance in optimization. This, in turn, affects the optimization process’s creativity and steadiness.

In a more extensive analysis, the paper reveals that data scientists can further hone the optimization process by understanding and leveraging the common patterns in high-quality instructions generated through the optimization steps. Additionally, the authors suggest that incorporating richer feedback on error cases and summarizing key features distinguishing high-quality and low-quality prompts could lead to more efficient improvements in generated instructions, indicating a promising direction for future work.

Through these investigations, the paper provides comprehensive insight into the nuanced factors affecting LLM optimization. It also lays a framework for enhancing their performance in instruction-following tasks, pivotal for the practical applicability of open-source LLMs in real-world scenarios.

OPRO framework (source:

Chain-of-Thought Reasoning is a Policy Improvement Operator

Authors: Hugh Zhang, David C. Parkes

TL;DR: This paper explores self-training in large language models, mainly arithmetic tasks. It demonstrates the potential of Chain of Thought (CoT) reasoning in enhancing the model’s self-learning capabilities, achieving up to 30-digit addition.


In a quest to push the boundaries of self-training in large language models, the paper delves into a structured experiment. It uses ByT5 models of varying sizes: 300M and 582M parameters.

The paper underscores the significance of an initial supervised learning phase before the models can transition into a self-training regime, with the larger model requiring fewer training examples to generalize effectively. The researchers tasked the models with learning arithmetic addition up to varying digit lengths and then meticulously recorded their performance metrics.

Remarkably, despite the 300M model’s halt at 25-digit addition and the 582M model at 29-digit addition, the self-training methodology enabled them to perform beyond their last training checkpoint. This result showcases an intriguing interplay between model size, training data, and self-learning capabilities.

Chain of Thought (CoT) reasoning emerged as a robust policy improvement operator, offering a glimpse into the models’ ability to continue self-learning over numerous iterations. The paper also touches on the broader implications and potential applications of these findings, speculating about the possibilities of extending this self-training methodology to more complex tasks or even eliminating the supervised learning phase in sufficiently large models.

Amidst the promising findings, the paper also candidly discusses the limitations and future directions, hinting at the broader discourse around self-learning, the necessity (or lack thereof) of grounding in real-world signals, and the safety concerns as models inch closer to autonomous learning. This exploration opens a dialogue on the future prospects and challenges for self-training large language models, particularly in objective domains like mathematics and programming.

Chain-of-Thought reasoning visualized (source:

Chain-of-Verification Reduces Hallucination in Large Language Models

Authors: Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celikyilmaz, Jason Weston

TL;DR: The paper introduces the Chain-of-Verification (CoVe) method to mitigate hallucinations in large language models. The approach improves LLM accuracy across several tasks by incorporating a verification step in the generation process.

Summary: As large language models rapidly evolve, they have performed remarkably in natural language processing tasks. However, a persistent issue plaguing these models is their tendency to generate hallucinated or factually incorrect information.

The authors propose a novel method to tackle this obstacle. Chain-of-Verification (CoVe) is a structured approach with a verification step to cross-check the generated responses, ensuring they are factually accurate before presenting them to the users.

The process begins with generating a baseline response to a given query. Following this, the model generates a set of verification questions to validate the information contained in the initial response. The model then attempts to answer these verification questions and revises the initial response based on this additional step to ensure accuracy and factual correctness.

In a detailed evaluation, the authors apply the CoVe method across multiple tasks, including list-based questions, closed-book question answering, and long-form text generation. A series of experiments demonstrate that CoVe significantly reduces the rate of hallucinations, improving the accuracy and reliability of LLM-generated responses.

The results suggest a promising pathway toward enhancing the performance of language models, making them more dependable for critical applications where factual accuracy is paramount. The CoVe method embodies a systematic way to self-check and correct the responses, bringing a new level of rigor to the generation process, which is instrumental in advancing the utility and trustworthiness of open-source LLMs.

Chain-of-Verification method (source:

From Sparse to Dense: GPT-4 Summarization with Chain of Density Prompting

Authors: Griffin Adams, Alexander Fabbri, Faisal Ladhak, Eric Lehman, Noémie Elhadad

TL;DR: This paper explores the concept of summary densification, aiming to strike a balance between informativeness and readability in machine-generated summaries.


In recent years, the field of automatic summarization has shifted from supervised fine-tuning to zero-shot prompting using large language models (LLMs). While these LLMs offer great control over summary characteristics, including length, topics, and style, they overlook a crucial aspect: information density.

As compressed versions of source texts, summaries should ideally contain a higher concentration of information. However, increasing information density must not sacrifice readability.

This paper introduces the concept of summary densification, an iterative process of making machine-generated summaries more information-dense while keeping them concise and coherent. The authors use a Chain of Density (CoD) prompt, starting with sparse entity coverage and gradually adding more entities without increasing the summary’s length.

Human evaluations suggest there is an optimal density level, with annotators preferring intermediate densification steps. Automatic metrics also reveal that increased densification correlates with higher informativeness, but must be carefully balanced to maintain quality and coherence in the summaries.

This study’s findings provide valuable insights into achieving the correct balance between informativeness and readability in machine-generated overviews.

Chain of Density prompt and example output (source:

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Authors: Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, Lee Sharkey

TL;DR: This paper delves into enhancing AI interpretability by reverse engineering neural networks and understanding the superposition phenomenon using sparse dictionary learning.


The rapid evolution of artificial intelligence, particularly in neural networks, has birthed systems whose inner workings and decision-making processes remain obscure. This obscurity has sparked concerns about the risks and implications of deploying AI systems that are not fully comprehensible, including the potential for AI-driven deception.

To address this concern, the study underscores the importance of mechanistic interpretability, a method aiming to unveil and understand the intricacies of neural networks.

Elhage et al. (2022) shed light on the superposition concept, suggesting that neural networks might comprehend more features than the layer’s available dimensions. This paper proposes a solution: sparse dictionary learning. This method represents data as a fusion of select elements from a defined set, providing insights into the multifaceted nature of neural networks.

This research presents tools to understand and dissect AI systems while offering hope for a future where AI system operations are advanced yet transparent.

Overview of the sparse dictionary learning method (source:

Final Thoughts

Are you ready to revolutionize the way you work with large volumes of text? Look no further than incorporating large language models into your workflow. This list of cutting-edge research on NLP serves as your guide to unlocking the full potential of this powerful technology. But don't just take our word for it—experiment and tweak to find the perfect model for your specific needs. And the journey doesn't have to be a solitary one—join our Discord community to share your discoveries and collaborate with like-minded individuals. Ready to dive in? Try out our NLP API on the Cohere playground and start building the future of natural language processing today.

Keep reading