Perhaps you're using AI to transform your customer support capabilities, enabling users to effortlessly self-serve their questions. Or creating an innovative knowledge assistant designed to seamlessly answer queries about your company's confidential documents. However you may be using AI, it’s likely that access to accurate and relevant information and cost control are your top priorities. Central to initiatives like these is the power of search and retrieval — the ability to efficiently unearth the right information and promptly deliver answers quickly and at affordable costs.

Many of the world’s business applications rely on having powerful search capabilities. Now, the latest generation of language models is bringing new life to legacy search systems and delivering unparalleled performance, even for complex domains and multilingual executions.

In this article, we reveal the ins and outs of how models like Rerank and Embed can improve accuracy, relevance, and speed for legacy search tools. We also explore how these models are becoming the secret sauce that’s powering enterprise AI applications with retrieval-augmented generation (RAG).

Let’s start by looking at the various elements that make up a search and retrieval system that’s boosted with AI.

The Elements of Great Search

Enterprise search has traditionally relied on keyword-based search methods (e.g., BM25). Many companies have also adopted semantic search, which is a powerful way to search large databases using the semantic meaning of a query. It differs from traditional keyword search in that it focuses on understanding the meaning behind the words used in a search query, rather than just matching keywords. Using embeddings for semantic search can enhance the relevancy and precision of results.

Today, it’s never been easier to boost a legacy search system by combining it or replacing it with semantic search using new state-of-the-art language models. This article will focus on two of the approaches that many of our customers take:

Reranking: Most customers start by adding a reranker to the last stage of their search system. A reranker is a language model that computes relevance scores for retrieved documents. Using a reranker lifts search performance with minimal interventions or costs. It is a simple plug-in model that can dramatically improve the accuracy of legacy search systems and downstream AI applications.
Dense retrieval: Some customers choose to evolve their search systems to incorporate dense retrieval, which requires computing and storing embeddings for all of the documents in their corpus. This is a larger lift than solely implementing a reranker, but it can lead to better upstream retrieval.

These two methods, separately or combined, are driving vast improvements in search for all sorts of applications, including powering the search and retrieval steps for RAG applications for enterprise. Let’s dive into each one separately.

Workflow of semantic search with embeddings followed by a reranker

Boosting Search with Rerank

One of the fastest and easiest ways to boost search is by using a reranker. A reranker model is a type of language model that computes a relevance score between a document and a search query. Rerankers can be applied to keyword, vector, or hybrid search systems. In all cases, adding a reranker tends to lead to improved performance. Rerankers can also be very quick to implement, with minimal interventions and costs. For example, Cohere Rerank can be added to a legacy search system with just a couple lines of code, improving results by as much as 50% based on academic benchmarks.

A reranker works as follows: for each query-response pairing, the model computes a relevance score, and these pairs are then ordered in descending order of their score. As the name hints, relevance scores are high for pairs in which the response is relevant to the query, and low otherwise. To implement, rerankers can be used in a variety of architectures and setups with or without a vector database.

Most applications with a search component will likely see performance improvements with a reranker. For example, a SaaS business that delivers workforce collaboration and productivity tools came to us because, like many search applications, their solution was frustrating customers by giving poor results and taking too much time to find the right answers. By implementing Rerank, they saw immediate improvements to search results which led to higher customer satisfaction.

Beyond improving legacy search, rerankers are also the fastest and easiest way to make a RAG pipeline better. RAG is a method of augmenting a generative model's natural language capabilities with specific and current information by connecting it to a knowledge base or proprietary datastore. Implementing RAG, though, can be complicated. For optimal performance, RAG requires powerful search capabilities that can handle analyzing large volumes of data across different sources quickly, efficiently, and reliably.

In most cases, adding Rerank to a RAG solution will deliver immediate improvements. We recently worked with a digital healthcare company building a health and wellness AI coach that considers user demographics, health goals, and content history to make recommendations to users. By connecting to the company's content platform and using the thousands of pieces of content they already had, the AI coach can draw from the company’s library of wellness advice as a source of truth. Adding Rerank to their RAG solution provides another layer of accuracy to the retrieval process and ultimately to the application’s performance.

Occasionally, adding a reranker to your search solution might not be the best option. If your application has simple queries and already delivers good performance, adding another layer of complexity may not be worth it. Although rerankers are great at dynamically improving search results when used on large retrievals, they can add latency to your application. They are compute-intensive, so it’s best to use them on a subset of results as the last stage of your search system. If your search system needs to handle large, complex language queries, we recommend starting with embeddings-based search followed by a reranker. We’ll cover this topic next.

Using Embeddings for Search

One popular type of semantic search is called dense retrieval, which is the process of retrieving documents based on their semantic similarity to a search query. There are typically four steps to semantic search:

Use an embedding model like Cohere Embed to turn documents into text embeddings, which are a type of vector (lists of numbers). These embedding vectors aim to capture the meaning of the text, and they are then stored in a vector database for retrieval later on. Consider working with a model that already has integrated partnerships with vector database suppliers to make deployment easier and more efficient.
At query time, compute the text embedding for a user’s search query and use it to search the vector database containing your previously computed document embeddings.
Use a vector similarity score to find the documents that are the most similar to the query. This step uses a vector database or search algorithm you want to deploy.
Finally, return a set of relevant documents. These documents can then be passed through a reranker to perform a final reordering by relevance for best results.

Building a dense retrieval pipeline with platforms like Elasticsearch, Opensearch, and Pinecone, can significantly enhance the quality of search results, particularly for multilingual text, complex language queries, or noisy data, compared to traditional keyword-based searches.

Many of our customers building AI knowledge assistants with RAG are choosing to implement semantic search with embeddings to improve their applications. For example, a financial research platform working with investment firms came to us wanting to build a knowledge assistant that could search across financial reports, analyst research, investor call transcripts, and other data to generate timely and accurate answers to typical investor questions in their own language. To boost the relevance and precision of their solution, they are implementing semantic search with embeddings using Cohere Embed, which also works across 100+ languages and can allow applications to multi-hop across datasets.

There are several elements to consider before you start with semantic search:

Picking the right embedding model
Preparing data for embedding
Choosing a vector database

Let’s cover each of these.

Picking the Right Embedding Model

Choosing the right embedding model for your product or service is a critical decision that can significantly impact the effectiveness of your search and retrieval systems and applications.

To get started quickly, make sure the embedding model that you are considering is compatible with your existing tech stack. For example, run a quick check to see whether the model can be deployed via your current cloud provider, such as AWS, OCI, Azure, or GCP, and is compatible with your vector database platform (some of these platforms have limitations on the dimensions of embeddings). Ideally you will want a model provider that has a strong community and can provide further maintenance and support. Then, turn to the specific capabilities and performance considerations.

Capabilities

When comparing models, look for advanced capabilities that you’ll likely need to scale your application. For example, Cohere Embed offers multilingual support for over 100 languages and can be used to search within a language (e.g., search with a French query on French documents) and across languages (e.g., search with a Chinese query on Finnish documents).

Many of our enterprise customers are faced with searching through noisy datasets with varying levels of content quality and information. With our latest embedding model, we introduced a new content quality measurement in addition to capturing topic similarity that helps to address this issue. The boost in retrieval performance from adding content quality checks also helps with multi-hop queries, where the search application needs to sort through multiple documents and combine information to find the right answer.

Because embedding models can serve multiple applications beyond search, check that the model can be optimized for specific use cases. For example, embeddings can also be used for clustering tasks or sentiment analysis, so you’ll want to be able to choose the most suitable type for your use case.

Performance

After reviewing the above capabilities, consider model performance across three areas: quality, speed, and storage costs. Look for state-of-the-art performance metrics trusted by the industry, like MTEB and BEIR, and be sure to conduct your own tests to compare models. It is important to test the model on actual production data since benchmarks may not be noisy or challenging enough.

Also, consider how fast the model generates embeddings and whether it can meet your application's performance requirements, especially if real-time processing is needed (e.g., if your datasets are updated daily).

Throughput can quickly become the most important metric as you look to scale the solution for two reasons. First, the costs of poor solutions can quickly skyrocket. For example, semantic search at scale, with over 100 million to billions of embeddings can be very expensive. That volume of embeddings requires large amounts of memory, increased computational demand, and further runtime requirements. To avoid excessive costs, explore whether the embeddings service and model can support compression, a method to reduce the memory you need for the vector space, which in turn reduces the costs. Secondly, you’ll want to have control over the deployment, letting you avoid common bottlenecks of managed offerings. For example, we offer our customers bulk embed options and private deployment.

Look for models that can scale with your data volume and user base without disproportionate increases in computational resources or delays.

Cohere Embed is optimized for performance, RAG quality, and compressions

Preparing Data for Embedding

Preparing data in advance of using an embedding model is important to ensure that the model can generate meaningful and accurate embeddings. This process involves several steps, each designed to clean, organize, and structure your data in a way that maximizes the effectiveness of the embedding model. See the table below with some key steps to take.

Data preparation steps to consider before embedding

A Deeper Look at Data Chunking

When it comes to enterprise solutions, data structuring can take many turns. Enterprises are working with multiple, large datasets, complex domain-specific information, and an assortment of potential use cases. Data chunking, the process of segmenting text or other data into smaller pieces or chunks, aims to generate more precise embeddings. Each chunk allows the numerical representation to capture the chunk's focused meaning, enabling better performance in tasks like semantic search and question answering. Think about it as if you had a four-sentence paragraph. If you had to summarize it with five words, you would lose most of the meaning. But if you could summarize each sentence with five words, you would be able to retain more of the overall meaning.

Implementing a chunking approach effectively requires a deep understanding of your data, clear objectives, and careful consideration of the technical and ethical aspects of data processing. The approach you take will vary depending on the data you are working with and the outcomes you wish to achieve. In most cases, you’ll need to iterate and try different strategies to determine which is the best approach for your particular use case.

The key points to remember here are:

Smaller chunks can lead to higher fidelity, but they also generate a larger search index. But be careful not to make the chunks too small as they can become too granular and miss out on context from surrounding, related sentences.
While larger chunks might be less precise and lead to less accurate answers, they will generate fewer embeddings, making the search index more manageable.

A simple rule is to start by identifying the context window for your chosen embedding model. The context window is the maximum number of tokens that a model can consider at one time, and it determines the upper limit of chunk size used for input. Most embedding models have context lengths ranging from 256-8K tokens, but we would recommend not exceeding chunks of around 512 tokens. From our research and experiments, dense embedding models often perform best with a few hundred tokens, even if they technically support longer contexts. Your chosen chunk size will define the unit of information that is stored and retrieved in your vector database. This impacts your memory costs, as well as the sources of information you can use and the level of performance or granularity you can expect. See the table below for some pros and cons of several chunking methods.

A comparison of the different approaches to data chunking

Choosing a Vector Database

To store, index, and manage your embeddings, you will need a vector database. Optimized to store high-dimensional vectors like embeddings, they can implement advanced indexing algorithms for search, such as approximate nearest neighbor search.

As a critical element in your AI pipeline, choosing the right vector database can provide another layer of efficiency and performance. Cloud services now provide secure and direct access to managed vector databases and can work with your chosen LLM. These are a good option if you don’t have the time and resources to build the infrastructure and handle ongoing management of the database. Alternatively, you can build and host a vector index in your environment for added control and management.

Choosing a vector database involves a comprehensive evaluation of factors in the context of your organization's specific needs, resources, and strategic goals. Look at your choices through various lenses, such as scalability, performance, efficiencies, integrations, costs, security, and customization.

Here are the top ten questions you should ask before making a decision:

Do you want a managed cloud solution or to self-host the database?
How well does the database scale as your data and workload grows?
Does it support advanced search features beyond vector similarity search?
What is the search quality, query latency, and throughput?
How quickly are newly added vectors available for querying?
What types of metadata filtering does it support (numeric, geo, etc.)?
Does it provide client libraries and integrations for the languages and frameworks you use?
How easy is the database to deploy, monitor and operate in production (e.g., cluster management tools)?
Does it provide the security and compliance capabilities you need, like encryption, access controls, and certifications?
What is the total cost, including hosting fees, data transfer, and compute?

You may also consider conducting pilot projects to directly assess individual fit with your use cases and requirements. A hands-on evaluation can provide valuable insights beyond what's available in documentation and vendor claims, ensuring that you make a well-informed decision.

Leveraging advanced language models like Embed and Rerank to enhance search capabilities signifies a transformative step for businesses aiming to build high performing AI solutions. These models not only revitalize legacy search systems but also ensure the delivery of precise, rapid, and cost-efficient RAG use cases. Combined with a RAG-optimized generative model like Command R, enterprises can deliver scalable, production-grade solutions.

Get started