Context by Cohere
Cohere int8 & binary Embeddings - Scale Your Vector Database to Large Datasets

Cohere int8 & binary Embeddings - Scale Your Vector Database to Large Datasets

Cohere Embed now natively supports int8 and binary embeddings to reduce memory cost.

Share:

Semantic search over large datasets can require a lot of memory which is expensive to host in a vector database. For example, searching across all of Wikipedia’s data, requires storing 250 million embeddings. With 1024 dimensions per embedding and each dimension as float32 with 4 bytes, you will need close to 1 TB of memory on a server.

We are excited to announce that Cohere Embed is the first embedding model that natively supports int8 and binary embeddings. When evaluated on the MIRACL benchmark, a semantic search benchmark developed by the University of Waterloo across 18 languages, our Embed v3 - int8 and Embed v3 - binary significantly outperform other embedding models, like OpenAI text-embedding-3-large, while reducing your cost 100x for memory from an estimated $130k to $1,300 annually.

Dimensionality reduction vs. compression

Most vector databases store embeddings and vector indices in memory. Each embedding dimension is typically stored as float32, so an embedding with 1024 dimensions requires 1024 x 4 bytes = 4096 bytes. For 250M embeddings, this results in 954 GB of memory without the ANN vector index.

The most common approach to reduce this huge memory requirement is dimensionality reduction, which performs poorly (see our research, where dimensionality reduction performs the worst).

Instead of reducing the number of dimensions, a better method is to train the model specifically to use fewer bytes per dimension. By using 1 byte per dimension, we reduce the memory 4x (954 GB → 238 GB) while keeping 99.99% of the original search quality. We can go even further, and use just 1 bit per dimension, which reduces the needed memory 32x (954 GB → 30 GB) while keeping 90-98% of the original search quality.

How to get int8 & binary embeddings from our API:

With the parameter embedding_types you can control which type of embeddings should be returned:

doc_embeddings = co.embed(texts=my_documents, model="embed-english-v3.0", 
input_type="search_document", embedding_types=["int8"]).embeddings

#Access the int8 embeddings via
doc_embeddings_int8 = doc_embeddings.int8

You can also pass in several values to get the embeddings in different types:

doc_embeddings = co.embed(texts=my_documents, model="embed-english-v3.0", 
input_type="search_document", embedding_types=["int8", "float"]).embeddings

#Access the embeddings via
doc_embeddings_int8 = doc_embeddings.int8
doc_embeddings_float = doc_embeddings.float

The following values are available:

Using int8-embeddings

Using int8-embeddings gives you a 4x memory saving and about a 30% speed-up in search, while keeping 99.99% of the search quality. In our opinion, it is a great solution and recommend it for most deployments.

It depends on your vector database if you can use int8-embeddings. Sadly, not all support it yet. In the following table you can find an overview of vector databases that support int8 embeddings to our knowledge at the moment:

Vector Databases Support
OpenSearch Yes
ElasticSearch Yes
Vespa.ai Yes
Azure AI Search Yes
Milvus Indirectly via IVF_SQ8
Qdrant Indirectly via Scalar Quantization
Faiss Indirectly via IndexHNSWSQ

Using int8-embeddings with OpenSearch

int8 / byte embeddings are supported starting with OpenSearch v2.9 and with Elasticsearch 8.12. In the following example you will see how you can create the respective index, how to compute and index document embeddings, and how to search on your index.

Cohere int8 / byte embeddings are also now available through Elastic’s Inference API after the Elasticsearch 8.13 release.

Using Binary Embeddings

Binary embeddings convert the 1024 float32 values to 1024 bit values, giving you a 32x reduction in memory. Because transferring and storing 1024 bits would be inefficient, these 1024 bit values are packed to 128 bytes, which you can either get as a signed int8 or as an unsigned uint8 value.

Besides the 32x reduction in memory, binary embeddings can also be searched 40x faster. You can search 1M binary embeddings within 20ms. This makes binary embeddings especially interesting for large, multi-tenancy setups. You can skip the build of many small ANN indices, and instead search directly on the necessary indices, which can even be off-loaded to disk.

The following table gives an overview of vector databases that support binary embeddings:

Vector Databases Support of Binary Embeddings
faiss Yes
Vespa.ai Yes
Milvus Yes
Qdrant Via Binary Quantization
Weaviate Via Binary Quantization

Using Binary Embeddings with faiss

The following example shows how to use binary embeddings with faiss:

We use IndexBinaryFlat as index, which is a brute-force index that scans over all documents. As binary embeddings are up to 40x faster than float embeddings, we can even use this index for larger document collections. Scanning 1M document embeddings can be done within 20ms. 

For larger collections (tens of millions of embeddings), faiss offers IndexBinaryHNSW, which can search over hundreds of millions of embeddings within milliseconds.

Better Search Quality with <float, binary> rescoring

The above script first uses the binary query embeddings to search across all document embeddings. This can be done extremely fast, as comparing two binary embeddings only uses 2 CPU cycles, allowing us to scan millions of document embeddings within milliseconds.

To improve the search quality, the float query embedding can be compared with the binary document embeddings using dot-product. So we first retrieve 10*top_k results with the binary query embedding, and then rescore the binary document embeddings with the float query embedding. This pushes the search quality from 90% to 95%. 

Compression-Friendly Embedding Model Training

Most embedding models are trained to just produce float32 embeddings. These models then often perform rather poorly when combined with vector space compression. You get a massive drop in search quality once you scale to hundreds of millions of embeddings.

As some of our customers use Embed v3 with tens of billions of embeddings, great search quality even for the largest vector spaces was a core design principle when we trained the model.

To enable this, we specifically trained our embeddings to be compression-friendly. They have to not only be trained to operate in a float32 space, but to perform equally well in int8, binary and with product quantization (more on this coming soon), giving you superior search quality and needing less memory for hosting the vector space.The most basic approach to train embedding models is via contrastive training. Here, you embed a question, a relevant answer, and (several) non-relevant documents:

During model training, you move the question and the relevant answer close in the vector space, and push the question and irrelevant documents far away in the vector space. Typically this happens using float32 embeddings.The issue here? You have no control whether the model will still work when used with vector compression. Instead of training just with float32 embeddings, we also made sure to train for int8 and binary precision as well as in a product quantized space. This ensures you the best search quality when using any of these compression techniques.

Full Evaluation

We evaluated all our models across many different datasets. Below you can find the results on the Massive Text Embedding Benchmark (MTEB).

Developers can get started with Embed on our playground. Learn more about Embed v3 and how to get started with compressed embeddings in our developer documentation. 

You can access the Wikipedia data embedded for you to use here.

Keep reading