The demand for applications powered by large language models (LLMs) is rising, from chatbots to virtual assistants to content generation. However, achieving optimal performance and accuracy often requires finetuning these models on specific tasks and domains.
Traditionally, finetuning involved updating the weights of all layers in the model. While this approach generally yields the best quality finetunes, it poses significant challenges in training time and serving costs. Updating all layers during finetuning can be time-consuming and require extensive computational resources, leading to slower iterations. Moreover, serving each finetuned model on a dedicated GPU resource incurs substantial costs, especially when dealing with multiple finetuned models.
At Cohere, we implement the T-Few finetuning technique (Liu et. al, 2022) for creating custom LLMs. Unlike traditional finetuning methods, T-Few finetuning selectively updates only a fraction of the model's weights, thus reducing training time and computational resources.
In this blog post, we will delve into the concept of T-Few finetuning, explore its benefits in training efficiency and serving scalability, and explain our implementation workflow.
An Overview of T-Few Finetuning
T-Few finetuning is an additive Parameter Efficient Finetuning technique that inserts additional layers, comprising approximately 0.01% of the baseline model's size. Specifically, it adds 1D vectors
L_FF that are multiplied with the
V, and feed-forward weights during inference.
These 1D vectors are multiplied with the corresponding tensors during inference. The dimensionality of the T-Few vectors is fixed and determined by the outer dimensions of the K, V, and feed-forward weights in each layer. This means that the dimensionality of the additive weights is fixed. These weights are generally much smaller than when using adapters.
The weight updates are localized to the T-Few layers during the finetuning process. Isolating the weight updates to these T-Few layers significantly reduces the overall training time compared to updating all layers. This process also reduces the resource footprint required to finetune the model.
Stacking T-Few Finetunes
The low overhead of T-Few's algorithm allows us to optimize how we serve finetunes. By stacking multiple specialized sets of weights to a single base model, we can efficiently serve many finetunes on a single GPU, enhancing serving scalability.
When we stack multiple T-Few finetunes, we now have 2D vectors for
L_FFN, where the additional dimension comes from the number of stacked T-Few finetunes. In this case, when we get requests for finetune A and finetune B, we slice the 2D vector into a 1D vector and use this for the computation. We use a unique identifier for each finetune to allow batches to include requests for multiple finetunes. Only the weights corresponding to the finetune requested is used, and we use the conditional computation above to ensure the right set of finetuning weights are used.
This means that the output from requests to a specific finetune is isolated from the values of the remaining finetune weights in the stack. This isolation in the stacked model serving is important when serving customers with different use cases and ensuring one customer's dataset/request does not negatively or positively impact the other customer's results.
This approach unlocks the ability to batch requests for different finetunes and perform concurrent inference. Rather than allocating and managing individual GPU resources per finetune, the stacked model condenses many finetunes into a single deployable unit. This maximizes GPU utilization by allowing multiple finetunes to share GPU resources during inference.
The concurrent inference capability provided by T-Few stacking revolutionizes the scalability of serving multiple finetunes, making it an ideal approach for applications requiring efficient and high-performance language models.
The T-Few Finetuning Workflow: A Deeper Look
In this section, we take a deeper look at how T-Few finetuning and serving are performed.
A finetuning job starts with a user request, which is done via the dashboard or the Python SDK. See our previous blog post for more information about creating a custom model.
The client request is sent to the backend with all the user-defined information, such as the finetuning dataset and hyperparameters. Specifically, these are sent via the Finetune Pub/Sub, which notifies the Finetune Manager.
The Finetune Manager receives the request and updates the finetuning status in the Finetune Pub/Sub. It then creates the finetuning workflow, which starts the finetuning run.
A finetuning workflow starts with a preprocessing step. The input dataset is preprocessed and prepared to be used for finetuning. This step ensures that the data is in the appropriate format and ready for model training.
Then, the finetuning run is performed on multiple hardware accelerators (either a TPU or a GPU), where the model weights are updated using the T-Few strategy. The efficiency of this technique means that the finetune weights are highly portable – the finetuned weights (e.g., 2MB) constitute only a fraction of the baseline model (e.g., 10GB).
Once finetuning is completed, the weights are uploaded to a cloud storage bucket. This step includes converting the weights into a format that is ready for serving.
For serving, the finetune weights are downloaded to the pod during the Init Container stage before we launch the Triton Inference Server. This is part of the model service component that handles serving the models to end users. The creation and updating of the model service is done via the Kubernetes API and our Cohere Kubernetes Operator.
By following these steps, the T-Few stacked model is set up for efficient serving, enabling concurrent inference on multiple finetunes and maximizing the utilization of the GPU resources.
T-Few finetuning offers an efficient approach to finetuning large language models, addressing the challenges of slow training times and costly serving resources. By updating only a small fraction of the model's weights and enabling model stacking, T-Few finetuning significantly reduces training time while maintaining high-quality finetunes.
Additionally, T-Few stacking allows for the concurrent inference of multiple finetunes, maximizing GPU utilization and improving serving scalability. With these benefits, T-Few finetuning becomes a valuable technique for efficient language model development and deployment.
See our documentation for more information about training custom models.
Thanks to Dwarak Talupuru and Siddhartha Rao Kamalakara for their contribution.