We’re excited to share that we’ve updated our capability for fine-tuning classification models to give customers better performance and enable new options for multilabel and multilingual classification.

Fine-tuning for classification models has been available to Cohere customers for quite some time. Enterprises use classification models to build applications such as spam filtering, sentiment analysis, credit scoring, medical diagnosis, and sentence classifiers. Inspired by customer feedback and recent research in the field, our hard work led to several breakthroughs.

Better Classification with Less Data

We developed a proprietary training method to re-architect our fine-tuning for classification capability. The improvements we made in the few-shot settings (i.e., training models using very small datasets) enabled us to reduce the requirements to fine-tune classification models. Previously, we required at least 250 text-label pairs to fine-tune a model. The new minimum is now 32. Generally, this means fewer text-label pairs to annotate. To evaluate if we made any trade-offs in performance, we compared our new training method with the previous one based on the Rotten Tomatoes sentiment detection task:

Our model fine-tuned with 40 examples using our new approach is noticeably better than our model with 8000+ examples using our old one. We repeated that work on a dozen other tasks and saw an average accuracy improvement of 30% for small datasets with less than 250 data points.

To get even better results, we still encourage you to add more examples greater than the minimum and ensure your examples are as diverse as possible. The quality of your data is very important. If using small datasets containing very few examples, double-check that each label is correct. Every error can significantly reduce the model's accuracy, given there are few examples to learn from.

Lower Latency and Higher Throughput

Speed is critical for many applications as classification is often part of a bigger pipeline that needs to obey strict latency constraints. We made several improvements to our overall serving infrastructure to handle requests more efficiently and improve throughput.

Customers can now configure two settings to improve throughput:

Batch Size - Batch size is the number of examples sent together in the same request. Larger batch sizes allow more efficient use of our GPUs and greatly improves throughput. It is recommended to batch as many examples as possible. The current maximum batch size is 96.
Number of Parallel Requests - If you cannot send requests in batches or want to send more than 96 examples simultaneously, you should send requests in parallel, where our infrastructure can combine some of these requests.

Here is a comparison of the number of examples processed each second by our old and new finetune approaches. These numbers are not theoretical, we got them by benchmarking our production system.

Throughput Comparison of Fine-tuning Approaches

As you can see, depending on the setup, our new system is between 7x and 120x faster than the previous one and can process more than 2000 examples per second. We hope this gain of several orders of magnitude helps unlock new use cases for classification.

Multilabel Classification

In some classification setups, the categories are not mutually exclusive. Let’s take the example of movie genres. One movie can be scary and funny; another can qualify as action, science-fiction, and fantasy. To build a classifier on movie genres, one could create binary classifiers for every genre, e.g., create a classifier for “Is this movie a comedy?” and another for “Is this movie scary?” etc. However, that means that when a new movie comes in, you need to run every one of these classification models to get a complete description of the movie. This process is expensive, cumbersome, and slow.

Enter multilabel classifiers. Multilabel classifiers are classification models that can predict several categories at the same time. With multilabel classification, you can now create one multilabel classifier that can categorize a movie into multiple genres in one go.

With our new offering, you can train a single-label or multilabel classification model. Training single-label and multilabel models follow the exact same process. The only difference is that in your training data, for single-label, each text piece must correspond to one category, while for multilabel, each text piece now corresponds to a list of categories, and it can contain zero, one, or more categories.

This unlocks a new range of use cases from document categorization, customer support ticket tagging, healthcare diagnosis, content recommendation, and more, all of which would require multilabel classifiers.

Fine-tuning English and Multilingual Classifiers

You can now choose between an English-only and a Multilingual base model when training a classification model.

The English model will be the best option if you know that more than 99% of the text pieces you will classify are written in English. The multilingual model will give you the best results on non-English texts. This model is also highly performant with English text, so it is always a good default.

Multilingual fine-tuning will give you better results on texts written in the same language as the text pieces you used for training. However, if you are unsure about what language will be used, or if you can’t get such examples during training, training with examples written in another language (e.g. English) should still give you great results.

MAE (Mean Absolute Error) Across Fine-tuning Methods

The table above compares our new fine-tuning approach with numbers reported in the SetFit paper on the well-known Multilingual Amazon Reviews Corpus. We trained our model on 40 English data points from the training set (8 per label) and tested it on 6 different languages (English, German, Japanese, Chinese, French, Spanish). While SetFit achieved strong results, multilingual classification with Cohere’s fine-tuning method was better.

Final Thoughts

As shown above, our new training classification pipelines went through a full rework to give you more accuracy on few-shot tasks (30%+), higher throughput (100x), and more flexibility (you can now train multilabel and multilingual models). It is already available on our SDK and fine-tuning dashboard, and we can not wait to see what you build with it!

To learn more, follow our developer guide and start fine-tuning classification models.

Fine-Tuning for Classification: Unlocking Multilabel and Multilingual Use Cases