
Introducing Embed v3
We're excited to introduce Embed v3, our latest and most advanced embeddings model. Embed v3 offers state-of-the-art performance per trusted MTEB and BEIR benchmarks.
One of the key improvements in Embed v3 is its ability to evaluate how well a query matches a document's topic and assesses the overall quality of the content. This means that it can rank the highest-quality documents at the top, which is especially helpful when dealing with noisy datasets. Additionally, we've implemented a special, compression-aware training method, which substantially reduces the cost of running your vector database. This allows you to efficiently handle billions of embeddings without causing a significant increase in your cloud infrastructure expenses.
With Embed v3, developers can immediately:
- Improve search applications that engage with real-world, noisy data
- Improve retrievals for retrieval-augmentation generation (RAG) systems
To try it for yourself, access Embed v3 now.
Overcoming Generative AI Limitations
One of the main challenges faced by today's generative models is their inability to connect with your company's data. For example, if you need a summary of discussions you've had with a particular client about pricing, standard generative models can't help because they lack knowledge about what was discussed and therefore cannot provide a summary.
A promising approach to overcoming this limitation is RAG. In our example, let’s say you have data about your conversations with your clients. That data can be transformed by an embedding model and stored in a vector database. If you now want a summary of a previous pricing discussion with a specific client, that embedding model can search for and retrieve the most relevant conversations, which can then be used to augment a generative model with relevant information.
This enables the generative model to provide a comprehensive summary, allowing you to ask detailed follow-up questions, such as “What objections did the client bring up?” and “How did we respond when other clients brought up similar objections?”
Good retrieval quality is essential to make this work.
Embed v3 is Cohere’s Newest Embedding Model
We are releasing new English and multilingual Embed versions with either 1024 or 384 dimensions. All models can be accessed via our APIs. As of October 2023, these models achieve state-of-the-art performance among 90+ models on the Massive Text Embedding Benchmark (MTEB) and state-of-the-art performance for zero-shot dense retrieval on BEIR.
* MTEB: Broad dataset for evaluating retrievals, classification, and clustering (56 datasets)
** BEIR: Dataset focused on out-of-domain retrievals (14 datasets)
All models return normalized embeddings and can use dot product, cosine similarity, and Euclidean distance as the similarity metric. All metrics return identical rankings.
The multilingual models support 100+ languages and can be used to search within a language (e.g., search with a French query on French documents) and across languages (e.g., search with a Chinese query on Finnish documents).
pip install -U cohere
The following code snippet shows an example of how to use the models for semantic search:
For more information on the code snippet above, see our GitHub repo.
New Mandatory Parameter: Input Type
The new models have a new required input parameter: input_type, that must be set for every API call and include one of the following four values:
- input_type="search_document": Use this for texts (documents) you want to store in your vector database
- input_type="search_query": Use this for search queries to find the most relevant documents in your vector database
- input_type="classification": Use this if you use the embeddings as an input for a classification system
- input_type="clustering": Use this if you use the embeddings for text clustering
Using these input types ensures the highest possible quality for the respective tasks. If you want to use the embeddings for multiple use cases, we recommend using input_type="search_document".
Why Is the Input Type Needed?
Embeddings can serve multiple applications. For example, for semantic search, you don't want to embed the sentiment of text in the vector space. When searching for iPhone reviews, you want to find positive and negative reviews. However, for clustering tasks, sentiment often plays a critical role, and you typically want to separate positive customer feedback from negative feedback. Previous models, without this distinction, often yield suboptimal performance.
Furthermore, when selecting input_type="search_document", the model can consider the content quality to yield the document with the highest quality for your search query.
Accuracy for Real-World Data
Previous models typically only measure the topic similarity between the query and the document. This is usually fine if your dataset contains one matching document per topic.
But in many real-world applications, you have redundant information with varying content quality. Some documents provide little insight into topics, while others are very detailed. Sadly, models that measure topic similarity only tend to retrieve the least informative content, leading to a bad user experience.
We can observe this with the ada-002 embedding model from OpenAI. Assume we have the following document collection:
docs = [
"COVID-19 has many symptoms.",
"COVID-19 symptoms are bad.",
"COVID-19 symptoms are not nice",
"COVID-19 is a disease caused by a virus. The most common symptoms are fever, chills, and sore throat, but there are a range of others.",
"COVID-19 symptoms can include: a high temperature or shivering (chills); a new, continuous cough; a loss or change to your sense of smell or taste; and many more",
"Dementia has the following symptoms: Experiencing memory loss, poor judgment, and confusion."
]
When searching for "COVID-19 symptoms", there is a large difference in search result quality between a model that matches topics only (ada-002) compared to a model that matches both topic and content quality (Embed v3).
We observe that the OpenAI ada-002 embedding model retrieves content matching the topic (COVID-19 symptoms), but it doesn't provide useful information for users or RAG applications. In contrast, Cohere’s Embed v3 model correctly identifies and ranks the most informative documents at the top.
We achieve this capability by measuring the topic match and content quality in the vector space. At query time, we can look for content that matches the topic (COVID-19 symptoms) and provides the most information. This significantly improves the user experience on noisy datasets with varying content quality.
Evaluating Search Accuracy for Noisy Datasets
This effect is well measured with the TREC-COVID dataset, where the Allen Institute for AI crawled scientific papers connected to COVID-19. Due to the nature of the web crawl, it was impossible to crawl every paper correctly, and hence the collection contains about 25% noisy data, entries where the correct crawl of the paper failed. These entries provide no useful information for users, as they only contain a paper title.
The next graph shows that models that don't measure content quality often retrieve this data noise, leading to a poor user and RAG experience. We are measuring nDCG@10, a metric that measures search quality of the top-10 results by considering the ranking logarithmically. The annotation for 50 queries (e.g., "What are the initial symptoms of COVID-19?") was performed by members of the NIST, who annotated nearly 70,000 scientific papers in multiple rounds for their relevance to the given query using a graded scale: not relevant, partially relevant, and highly relevant.
Better Retrieval for Multi-Hop Queries and RAG Systems
RAG is especially promising for multi-hop queries. Multi-hop queries are queries where the answer cannot be found in a single document (which we could show as a top-ranked search result hit) but require combining information from different documents.
Here we see an example from HotpotQA, a dataset for multi-hop questions developed by Carnegie Mellon University, Stanford University, and Université de Montréal. For the depicted question, paragraphs from the Wikipedia articles Return to Olympus and Mother Love Bone must be retrieved and provided to the generative model as context to infer the correct answer.
Modeling this as a multi-step iterative process would be optimal, but it is challenging to set up and run in practice. How can we know the number of steps needed to find the final answer? How can we spot missing information and retrieve it? Can we keep the latency acceptable?
Hence, in practice, nearly all RAG systems use single-hop retrieval, and we rely on having all relevant information as part of the top-10 list we provide to generative models.
The HotpotQA dataset from BEIR is a great benchmark for this. It measures whether or not we can retrieve all relevant paragraphs to answer a query. As mentioned, each question requires retrieving multiple paragraphs from different documents. The following graph compares nDCG@10 on this dataset.
By boosting the retrieval performance, we will see that RAG systems can provide enhanced information, even for the most challenging queries requiring information from multiple sources.
Training for Quality and Scalability
Stage 1: Web Crawl for Topic Similarity
Our embedding models have been trained in multiple stages. First, they have been trained on questions and answers from a large web crawl. When we presented our multilingual-v2.0 model last year, we had a collection of over 1.4 billion question-and-answer pairs from 100+ languages on basically every topic on the internet. This first stage ensures the learning of topic similarity between questions and documents (i.e., it will find documents on the same topic as the query).
Stage 2: Search Queries for Content Quality
As shown before, learning topic similarity isn't sufficient for many real-world datasets, where you can have redundant information with varying quality levels. Hence, the second stage involved measuring content quality. We used over 3 million search queries from search engines and retrieved the top-10 most similar documents for each query. A large model was then used to rank this according to their content quality for the given query: which document provides the most relevant information, and which the least?
This signal was returned to the embedding model as feedback to differentiate between high-quality and low-quality content on a given query. The model is trained to understand a broad spectrum of topics and domains using millions of queries.
Stage 3: Embeddings Optimized for Compression
The final stage involves special, compression-aware training. Running semantic search at scale (with hundreds of millions to billions of embeddings) causes high infrastructure costs for the underlying vector database, several magnitudes higher than computing the embeddings. The final stage ensures that the models work well with vector compression methods, reducing your vector database costs by several factors while keeping up to 99.99% search quality. We will soon provide more information on accessing the compressed vectors and saving on your vector database costs.
Model Evaluation with MTEB, BEIR, and MIRACL
In the previous sections, we provided model performance on some selected datasets. We also benchmarked our models extensively on various well-known benchmarks.
MTEB: Massive Text Embedding Benchmark
MTEB is a large text embedding benchmark that measures embedding models across seven tasks: classification, clustering, pair classification, re-ranking, retrieval, STS (semantic textual similarity), and summarization. It includes 56 datasets from various domains and with various text lengths.
Our new Embed English v3 model is ranked first among 90 text embedding models, and the Embed Multilingual v3 model is ranked first among multilingual models. All evaluation results can be found in the embed v3.0 evaluation spreadsheet.
Results on MTEB show the broad capability of the model for various tasks and domains, making it a great default choice.
BEIR: Out-of-Domain Information Retrieval
BEIR is a benchmark focused on out-of-domain information retrieval. Originally it consisted of 18 datasets, but now, just 14 are publically available (and due to license changes for the Twitter API, one dataset can no longer be accessed). We benchmarked all 18 datasets, but focused on the 14 publicly available datasets to allow easier reproduction.
The BEIR paper shows that out-of-domain information retrieval is especially challenging for text embedding models, which perform well on their trained datasets, but struggle when applied to other datasets and domains. As most users don't have training data for their data, out-of-domain performance is the most critical indicator for embedding models.
Unfortunately, many recently published embedding models train on these datasets, and they even started to train on the respective test sets (i.e., tell the model the correct answers for the test set). For our training, we excluded any potential overlap with the test sets. All results can be viewed in our BEIR eval spreadsheet.
MIRACL: Semantic Search Across 100+ Languages
Our multilingual version of Embed v3 is highly performant with over 100 languages, including Chinese, French, Japanese, Korean, Spanish, and more. This versatility makes it a valuable resource for customers building apps that encompass data from multiple languages, such as semantic search, customer sentiment analysis, and content moderation. We used the MIRACL benchmark to evaluate how well Embed v3 performs across multiple languages. As with BEIR, we avoided overlaps between training and test sets and presented zero-shot performances. Full results for the MIRACL dev-set can be found in our spreadsheet.
Get Started with Embed v3
As you can tell, we’re excited about Embed v3 and the leap forward in performance, allowing developers to improve search and recommendations for their applications.
You can access Embed now with the API key provided with you Cohere account. Customers using Embed on other AI cloud platforms will gain access to the new Embed version soon. For more information, see our developer documentation.
Interested in learning more? Join us for a webinar on November 20th at 11:00 am EST where Nils Reimers (Creator of SBERT and Cohere’s Director of Embeddings) will provide an in-depth walkthrough of the benefits of using Embed v3.