Context by Cohere
Scaling Semantic Search Systems with Pinecone and Cohere

Scaling Semantic Search Systems with Pinecone and Cohere


Semantic search is a critical component of Retrieval-Augmented Generation (RAG) workflows. Although it is easier than ever to build a small-scale semantic search index for a proof of concept project, it can be difficult to scale semantic search for enterprise RAG use cases that include customer support chatbots and internal knowledge assistants. For example, it’s challenging to maintain the low latencies that are required for production workflows, even with purpose-built vector databases. And regardless of the underlying choice of database, developers have to provision infrastructure, estimate their usage, and resize clusters constantly, all while being mindful of cost.

Forget About Managing Clusters: Pinecone’s Serverless Solution

Pinecone recently introduced a new vector database architecture called Pinecone serverless that addresses these key issues. With Serverless, developers can focus on building applications at any scale (millions to billions of vectors) and not have to worry about provisioning and managing clusters. This can significantly reduce the overhead associated with managing the computational resources required for low latency embedding storage and retrieval. The separation of reads, writes, and storage also significantly reduces costs for all types and sizes of workloads. This means that RAG workflows are faster while operating at a lower cost to run.

Faster Embedding Jobs with Cohere and Pinecone

Cohere Embed is our leading text representation language model. It’s particularly performant in real-world scenarios with noisy data and RAG use cases. Embed works with Pinecone serverless just like it would with any other Pinecone index for fast and scalable vector search. After embeddings are generated through the Cohere API, they can be upserted into Pinecone serverless, where they can be indexed and searched at low latencies.

When building semantic search systems at scale, the combination of Cohere’s new Embed Jobs endpoint and the Pinecone serverless vector database creates a powerful toolkit. The Embed Jobs endpoint enables asynchronous embedding generation, eliminating the need to configure optimal batches and manage a large-scale synchronous embedding process. Embed Jobs also stages and validates your data, so once a job is launched, there is no need to manage partial completions from user errors or API downtime. Using Pinecone serverless in combination with Embed Jobs simplifies the process of working with embeddings, making it easier to deploy and scale enterprise semantic search and RAG applications.

Scaling Semantic Search: Cohere and Pinecone in Action

For a practical illustration, consider building a RAG-powered chatbot or semantic search system that seamlessly integrates Cohere Embed and Pinecone serverless. The accompanying notebook provides an example workflow, allowing developers to explore and implement this combination firsthand.

Encoding Large Corpora with Embed Jobs

Certain customers need to scale to hundreds of millions or even tens of billions of embeddings. Getting these embeddings via an API can be painful and slow — it results in millions of HTTP-requests sent between your system and our servers.

To make this process easier, we introduced Embed Jobs. Here, you bulk upload your dataset to our servers and compute the embeddings on our servers. Once it is complete, Embed Jobs offers you a way to download the embeddings. We used this method to efficiently encode billions of embeddings per day for individual customers.

Let’s take a look at the steps to bulk-uploading a dataset using Embed Jobs.

Step 1: Upload a Dataset

To get started, you first need to create a new dataset on our servers. In this example, we’ll use embed_jobs_sample_data.jsonl, which contains about 1700 paragraphs from Wikipedia.

With the following code, we create the dataset and upload this file to the Cohere servers.

  import cohere 
  co = cohere.Client('COHERE_API_KEY')
  # Upload a dataset for embed jobs
  dataset_file_path = 'embed_jobs_sample_data.jsonl'
  ds = co.create_dataset(
    data=open(dataset_file_path, 'rb'),

Step 2: Create Embeddings via Embed Jobs

Once your dataset is uploaded, you can start the embedding job. You’ll need to pass in the dataset ID from the previous step, and specify the model you want to use and the needed parameters. Here, we pick the embed-english-v3.0 model, specify the input_type as search_document (as we want to upsert it into our Pinecone database).

  job = co.create_embed_job(,
  job.wait() # poll the server until the job is completed 

Step 3: Download the Dataset

While the above job is running, you are able to ping our servers to get the progress on your embedding job. Once all documents are embedded, you can download your embeddings and use them in your script like this:

  # Load the output file into an array
  data_array = []
  for record in output_dataset:

To learn more, see Cohere's Embed Jobs documentation and the Pinecone documentation.

Keep reading