TL;DR

Are emerging properties that lead to quantization difficulties truly inherent to scale, or can be altered and conditioned by optimization choices?

Recent advances in Large Language Models (LLMs) have led to their broad adoption in various downstream applications such as copywriting, summarization, and chatbots to name a few. With the size of these models growing past billions of parameters, efficiency is the name of the game when it comes to deploying these models. One of the most effective techniques to make any neural network efficient is Post-Training Quantization (PTQ). The core idea of quantization is to store the data at a lower precision. In simple terms, quantization corresponds to rounding values of neural network parameters together with necessary scaling operations. For instance, a weight value of 16.358 in 32-bit floating point format, can be converted into a 8-bit integer as 16. This amounts to using less computation and storage, because, in neural networks, the amount of time taken to process inputs and to generate outputs (latency) is the sum of two components: data movement and arithmetic operations. Quantization helps improve upon both these facets – using a lower precision helps transfer data in the GPU faster and also enables leveraging specialized hardware in modern GPUs that reduces the time taken for data movement and the matrix multiplications respectively. However, quantizing LLMs has proven to be significantly more challenging as they grow in size.

This difficulty in quantization at scale has commonly been described to be an “emergent property”. In the context of LLMs, emergent properties describe properties that seem to exist in larger models but not in smaller ones [1]. Figure 1 shows an example of an emergent property – few-shot prompting performance significantly increases after a certain scale. Notably, there has been significant interest in these emergent properties. Previous work has discovered that few outlier dimensions emerge in network hidden states after a certain scale and make quantization of large-scale models much harder [2].

Figure 1: The ability to perform a task via few-shot prompting is emergent; beyond a certain model scale, performance significantly increases to well above random [1]

Simply, in LLMs, values in certain positions in the hidden states are consistently larger (in magnitude) than the rest of the values which leads to suboptimal quantization, hampering downstream performance.

In our recent work "Intriguing Properties of Quantization at Scale", we ask and attempt to answer the question: are emergent properties due to nature or nurture? More concretely, are outlier dimensions an inherent property of LLMs or are they a result of optimization choices made in pre-training?

We study some of the popular optimization hyper-parameters used in the LLM pre-training to see how they impact the model’s post-quantization performance.

Weight decay: prevents overfitting by penalizing large magnitude weights
Gradient clipping: clips large gradients to prevent exploding gradients and accelerate convergence
Residual dropout: drop neurons in layers immediately preceding a residual connection
Half-precision data type: data type in which forward and backward computation is performed (Mixed-precision training is typically used in LLM pre-training to reduce computation time and memory requirements).

We use a simple 8-bit integer quantization which results in high efficiency in terms of memory and throughput without requiring any tuning or extra decomposition. More technically, we apply a technique (vectorwise quantization) where each row and column of both weights and hidden states is quantized, respectively, increasing quantization granularity.

Finding the Optimal Pre-training Choices

To explore the impact of each optimization choice, we train multiple GPT-style (decoder only) language models with 6 Billion parameters for 75000 steps. We report the mean normalized relative performance difference for each variant across 8 tasks, in the below figures (negative values denote degradation).

Figure 2: PTQ degradations when varying optimization choices. For each experimental axis, the other hyperparameters remain fixed.

Figure 3 **Left**: PTQ degradation when varying data types **Right**: PTQ degradation over time

We observe that a relatively large weight decay value and gradient clipping lead to a better PTQ performance. Interestingly, data type – float16 (fp16) or bfloat16 (bf16) – used in half-precision training makes the most significant difference in quantization performance. Float16 variants (colored in green in Figure 3 Left) show the highest degradation with an increasing performance drop as they are trained more (green lines in Figure 3 Right).

This is valuable insight as the choice of half-precision data type most commonly depends on the available training hardware and most competitive LLMs are pre-trained for 100s of thousands of steps.

Scaling to 52B

From our experiments with 6 Billion parameter size models, we infer that high weight-decay, no dropout, gradient clipping, and surprisingly, bloat16 training are the most PTQ friendly pre-training choices.

To validate this at different scales, we fully train models ranging from 410M-52B in size with these hyperparameters. Across the same set of tasks, our 52B model shows a 0.08% improvement in PTQ performance while OPT-66B, which is the closest OPT model in terms of size, has ~42% mean degradation.

Figure 4: Cohere-int8 models significantly outperform OPT-int8 models

Figure 5 compares of mean zero-shot accuracy on HellaSwag, PIQA, LAMBADA, and WinoGrad between Cohere & OPT models for float16 baseline and 8-bit integer quantization with the OPT numbers directly taken from LLM.int8(). Notice that OPT-Int8 (light green) has a significant drop in performance as the number of parameters increases. In contrast, the performance for Cohere-Int8 does not show any degradation compared to float16 baseline even if the number of parameters is very large.

Conclusion

Our results support the conclusion that optimization choices play a large role in whether emergent outliers are present at a large scale. This conclusion allows us to train very large language models that are robust to post-training quantization. We believe there is more work to be done here. We also hope that the insights gained from our work illustrate the significant impact the underlying hardware can have on PTQ given the support of bfloat16 being exclusively hardware dependent.

Acknowledgment

We thank João Araújo, Milad Alizadeh, and other colleagues in Cohere & Cohere For AI for helpful feedback and support. We also thank Tim Dettmers for assisting us in replicating the outlier dimension definition and results in int8.LLM(). We would also like to thank Luis Serrano for providing feedback on this post.

References

[1] Emergent abilities of large language models

[2] LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale