The adoption of large language models (LLMs) is on the rise. Previously, many natural language processing (NLP) use cases required deploying several different models. With LLMs, one general-purpose model can support a wide variety of NLP use cases, greatly simplifying the integration of language-based machine learning capabilities, such as text generation, classification, semantic search, topic modeling, and entity extraction, into applications and systems.
At Cohere, our mission is to reduce the complexity of integrating NLP even further by exposing the capabilities of LLMs via a simple API. Our platform enables developers and teams to leverage the versatility and performance of LLMs without having to have the resources and expertise to build and deploy these models themselves.
The Cohere platform currently offers three types of endpoints:
- Classify, performs text classification with just a few labeled examples. It powers NLP functionality, such as content moderation, customer intent classification, and sentiment analysis.
- Generate, creates text for narrative content, such as articles, marketing copy, and summaries. It also extracts key information from documents.
- Embed, generates representations of a piece of text as embeddings. This can be useful for semantic search, recommendation engines, clustering, topic modeling, and more.
Enhancing Our Platform Performance
As the Cohere platform grows, we are continually looking for ways to improve the experience of interacting with our API. One of our key focus areas is the platform’s inference latency and throughput. From a user’s point of view, this translates into the time it takes to receive a response after making a request.
This is especially important in time-sensitive applications, such as customer support chatbots. For example, when conversing with a virtual agent, an end-user expects a swift response to their queries. Frequent delays may result in the user leaving the conversation in frustration and not getting the support they were looking for.
As we explored ways to get the best possible model performance, we realized the importance of having the flexibility to choose how our backend is implemented. We wanted to maximize the potential of the NVIDIA GPUs, and we could only achieve that if we had more implementation options.
For example, our previous setup's inference mechanism was deployed primarily based on pipeline parallelism, which didn’t allow us to take full advantage of multi-GPU inference. We would have realized a better model latency by switching to tensor parallelism, but with our previous backend framework, this would have taken an enormous amount of effort and customization to make it work.
Leveraging the NVIDIA Triton Inference Server
Because of these reasons, we opted for the NVIDIA Triton Inference Server as our inference server framework. The key factor in our decision was that Triton could provide the flexibility of choosing different backend frameworks out of the box, allowing us to evaluate a selection of backends and identify the one best suited to our platform needs.
One of the backend frameworks that Triton supports is FasterTransformer, an open-source library that implements a highly optimized transformer layer for both the encoder and decoder to speed up inference. So, along with migrating to Triton as our inference server, we also migrated to FasterTransformer as our backend framework.
Overall, the outcome has been impressive—after we migrated to Triton and started using the FasterTransformer backend, we’ve observed an increase of up to 4x in inference speed.
This was possible because FasterTransformer supports multi-GPU inference with tensor sharding. In other words, we were able to add tensor parallelism in our inference setup. This alone was massive because it meant that we could really maximize the potential of the multi-GPU system by significantly increasing its overall efficiency and utilization.
There were a few other factors in our decision. For example, the fast inter-GPU communication and ready-to-use fused operation kernels provided by the FasterTransformer backend also make it possible to further improve speed and performance.
Look at any industry vertical or individual organization, and you will find piles of unstructured text data. And this volume will only continue to grow as more and more of the world’s population interacts online at an unprecedented rate. Imagine if there was a much easier way to process this data and make it useful and actionable?
Cohere is dedicated to making NLP technology—previously only available to the big players—accessible to developers and teams of any size. By continuing to enhance the experience of interacting with our API, we can unlock even more use cases and serve even more developers and organizations. Migrating to NVIDIA Triton Inference Server helps us take significant strides toward this goal, and we are excited about what we can achieve with it going forward.