With the release of Elasticsearch 8.13, developers can now access Cohere’s Embed v3 models in Elastic’s Inference API. This enables businesses to easily create embeddings for their data, index those embeddings in Elastic, and perform vector and hybrid search across their documents.

Developers can use Elastic’s ingest pipelines to add Cohere embeddings to their indices for vector search with a single API call, and they can take advantage of Cohere’s native embedding compression to reduce storage costs by 75%.

Simplified Vector Search with Elastic’s Inference API

Elastic’s Inference API makes it easy for developers to access and perform inference on AI models in their Elastic environment. The API makes it simple to create and query your vector indices in Elastic by eliminating the need to self-host a model or make separate inference calls through an external API. And notably, rather than having to manually iterate through every document in an existing lexical search index to add vector embeddings, you can create an inference ingestion pipeline and reindex your data with a single API call.

Cohere’s entire Embed v3 model series — including our state-of-the-art English (embed-english-v3.0) and multilingual embedding models (embed-multilingual-v3.0) — is now available through Elastic’s Inference API in both Elastic Cloud and Elastic self-managed environments.

We are also excited to announce that both float and int8 (or byte) embeddings are supported natively through the Inference API. Int8 compression allows users to take advantage of Elastic’s support for byte vectors and reduce the size of their embeddings by 4x with minimal impact to search quality. Because vector DB costs are in part correlated with the size of the vectors being stored, this enables developers to reduce storage costs without compromising on accuracy. Cohere’s embedding models, when used with int8 (byte) compression, offer competitive performance against OpenAI’s embedding model at a fraction of the cost for storage. The chart below illustrates a comparison of the MTEB accuracy of various embedding models compared to the cost of storing a ~250M embeddings dataset.

We are excited to make our embeddings more easily accessible to developers in Elastic’s industry leading platform.

How to Use Cohere with the Elastic Inference API

Implementation of Cohere’s embeddings using the Inference API requires only a few API calls.

First, start by creating an inference model, specifying one of Cohere’s embedding models. In this case, we will use Cohere’s baseline English model `embed-english-v3.0`, with int8 compression. To use int8 compression in Elastic, we will specify an `embedding_type` of `byte`.

PUT _inference/text_embedding/cohere_embeddings 
{
    "service": "cohere",
    "service_settings": {
        "api_key": "<cohere_api_key>", 
        "model_id": "embed-english-v3.0", 
        "embedding_type": "byte"
    }
}

Next, create an index mapping for the new index that will contain your embeddings. Here you will specify certain parameters determined by your choice of vector and compression technique.

PUT cohere-embeddings
{
  "mappings": {
    "properties": {
      "content_embedding": { 
        "type": "dense_vector", 
        "dims": 1024, 
        "element_type": "byte"
      },
      "content": { 
        "type": "text" 
      }
    }
  }
}

Next, create an ingest pipeline with an inference processor to automate the computation of embeddings when ingesting content into your index.

PUT _ingest/pipeline/cohere_embeddings
{
  "processors": [
    {
      "inference": {
        "model_id": "cohere_embeddings", 
        "input_output": { 
          "input_field": "content",
          "output_field": "content_embedding"
        }
      }
    }
  ]
}

To complete the setup of your new index, reindex the data from an existing source using the ingestion pipeline you just created. Your new index will contain embeddings for all of the text data in the input field specified in the pipeline for you to use in semantic search.

POST _reindex?wait_for_completion=false
{
  "source": {
    "index": "test-data",
    "size": 50 
  },
  "dest": {
    "index": "cohere-embeddings",
    "pipeline": "cohere_embeddings"
  }
}

Now that your index is created, you can query it easily using knn vector search with Cohere’s embeddings.

GET cohere-embeddings/_search
{
  "knn": {
    "field": "content_embedding",
    "query_vector_builder": {
      "text_embedding": {
        "model_id": "cohere_embeddings",
        "model_text": "Elasticsearch and Cohere"
      }
    },
    "k": 10,
    "num_candidates": 100
  },
  "_source": [
    "id",
    "content"
  ]
}

And there you have it! Semantic search across your Elastic index with Cohere embeddings. To get started, create a trial API key for Cohere and try out Elastic’s Inference API with Cohere’s embeddings.

If you’re interested in implementing semantic search or retrieval-augmented generation pipelines at enterprise scale, and would like to speak with an expert on our team, please get in touch.