Context by Cohere
Semantic Search with OpenSearch and Cohere: A Comprehensive Demo

Semantic Search with OpenSearch and Cohere: A Comprehensive Demo

Share:

TL;DR

This blog explains how to perform semantic search with Cohere and OpenSearch. It provides step-by-step instructions via Python on how to spin up a local OpenSearch instance, store Cohere embeddings in an OpenSearch index, and retrieve and use these embeddings.

Introduction

OpenSearch is an open-source, distributed search and analytics engine platform that allows users to search, analyze, and visualize large volumes of data in real time. When it comes to text search, OpenSearch is well-known for powering keyword-based (also called lexical) search methods.

But along with the rapid progress in machine learning, semantic search is increasingly becoming the preferred option for text search due to its ability to capture contextual information beyond just keywords. So if your system has an existing OpenSearch implementation, how can you leverage this new capability?

The good news is that OpenSearch also provides support for vector search, allowing you to add this technology without having to go through complex migrations. Using Cohere embeddings, you can seamlessly add semantic search capabilities to your existing system.

Overview of the Demo

In this article, we’ll walk through the steps to build a demo project using Python that implements semantic search in OpenSearch, powered by Cohere’s text embeddings. We’ll go through the following steps:

  • Step 1: Spin up an instance of OpenSearch
  • Step 2: Embed your documents
  • Step 3: Create an index for your documents
  • Step 4: Query your index for similar documents using Cohere embeddings

At the end of this article, we’ll perform a comparison between the search results generated from lexical, fuzzy, and semantic approaches.

Keep in mind that the instructions provided in this blog are based on OpenSearch version 2.7.0. So, let's dive in and get started!

For this demo, we are going to be using the arXiv dataset to populate an index in OpenSearch using Cohere embeddings. The arXiv dataset contains scholarly articles, from many subdisciplines, that can be hard to query if you're looking for something specific. In this demo, we will perform semantic search given a query and find similar documents within the arXiv dataset corpus. Note: only 5k documents from the arXiv corpus have been used for demo purposes.

You can find the source code for this demo in this Github repository.

Step-by-Step Walkthrough

The following steps outline how a user would perform semantic search with Cohere and OpenSearch. Note that the instructions provided herein are based on OpenSearch version 2.7.0.

Step 1: Spin Up an Instance of OpenSearch

To get started, we will spin up a local OpenSearch cluster utilizing Docker and Docker Compose. If you're using a Linux system, follow the installation instructions for both Docker and Docker Compose.

To get started, save the docker-compose.yml file we've provided below. It'll help you quickly instantiate an instance of the OpenSearch database cluster version 2.7.0. After that, run docker-compose up, and you'll have your OpenSearch instance up and running at [<http://localhost:9200>].

version: "3"
services:
  opensearch-node:
    image: opensearchproject/opensearch:2.7.0
    container_name: opensearch-node
    environment:
      - discovery.type=single-node
      - "DISABLE_INSTALL_DEMO_CONFIG=true"
      - "DISABLE_SECURITY_PLUGIN=true"
      - plugins.ml_commons.only_run_on_ml_node=false
    ulimits:
      memlock:
        soft: -1
        hard: -1
    ports:
      - 9200:9200

To ensure that your server is up and running, you can run the command curl localhost:9200. If everything is working correctly, you should see a similar output to the one below:

{
  "name" : "00a9dbfa7905",
  "cluster_name" : "docker-cluster",
  "cluster_uuid" : "WC7Xz3BpT5ueMD9TYyQ2DQ",
  "version" : {
    "distribution" : "opensearch",
    "number" : "2.7.0",
    "build_type" : "tar",
    "build_hash" : "b7a6e09e492b1e965d827525f7863b366ef0e304",
    "build_date" : "2023-04-27T21:43:09.523336706Z",
    "build_snapshot" : false,
    "lucene_version" : "9.5.0",
    "minimum_wire_compatibility_version" : "7.10.0",
    "minimum_index_compatibility_version" : "7.0.0"
  },
  "tagline" : "The OpenSearch Project: https://opensearch.org/"
}

Step 2: Embed Your Documents

Awesome! Your OpenSearch instance is up and running. Now it's time to move on to Step 2.

Now, we're going to use Cohere's Embed endpoint to embed documents into vectors. OpenSearch's k-NN search can support vectors up to 10,000 in dimensionality when using the nmslib or faiss engines. We'll be using the nmslib in OpenSearch and, to make things simpler, we'll use Cohere's embed-english-light-v2.0 model, which returns vectors of dimensionality 1024.

To get started, you'll need to download the arXiv dataset. We've already subset it to 5000 rows for demonstration purposes and saved it to a local /data folder located on Github.

Before we proceed, you'll need to create a Cohere account and grab your free trial API_KEY. Once you have your API_KEY, you're ready to move on to the next step.

To create a client and interact with the Embed endpoint, we'll be using the Cohere Python package, which you can install via pip install cohere>=3.8.0. Once it's installed, you can instantiate a client as shown below.

import cohere
co = cohere.Client('YOUR_API_KEY')

Next, you'll need to read in your local data file and create a list of texts to send to the Embed endpoint. For this demo, we'll be using the abstract column to create embeddings. Here's an example of how to do it:

import pandas as pd

df = pd.read_csv(<PATH_TO_DATASET>).fillna("").reset_index(drop=True)

texts = []
for text in df["abstract"].values.tolist():
    texts.append(text[0])

Once you have a list of texts to embed, hit the embed endpoint using the following helper function to get back a list of embedding vectors. Each item in embed_list should have 1024 float numbers.

import numpy as np 
from typing import List, Union 
def get_cohere_embedding(
    text: Union[str, List[str]], model_name: str = "embed-english-light-v2.0"
) -> List[float]:
    """
    Embed a single text with cohere client and return list of floats
    """
    if type(text) == str:
        embed = co.embed([text], model=model_name).embeddings[0]
    else:
        embed = co.embed(text, model=model_name).embeddings
    return embed
embed_list = get_cohere_embedding(texts)

To save time and cost, dump out the embedding vectors as a .jsonl file to be used in the next step.

cache = dict(zip(texts, embed_list))

with open("cache.jsonl", "w") as fp:
    json.dump(cache, fp)

Congratulations! Your corpus is now embedded and ready to be used for semantic search.

The full script can be run with python cache_vectors.py.

Step 3: Create an Index for Your Documents

Great! Moving on to Step 3, now that we have our embeddings cached in a JSONL file, we will proceed to index these embeddings in OpenSearch.

In order to perform semantic search with OpenSearch, we need to create an index to store our documents and their dense vectors. This will allow OpenSearch to efficiently retrieve the nearest neighbors to a given query vector.

The k-NN plugin in OpenSearch has three methods that we can use to search our corpus using vectors. These methods are:

  1. Approximate k-NN
  2. Script Score
  3. Painless extensions

For this demo, we will run the Approximate k-NN method, which will reduce the dimensionality of vectors to be searched and then reindex the document index. Similar documents will be computed as the distance between the query vector and potential hits. This approach is particularly useful when the dimensionality of the vector space is large, so it's a good option for our use case.

To get started with Approximate k-NN, we need to create an index in OpenSearch with the index.knn parameter set to true.

Additionally, we need to set configurations for the kNN search. See the documentation for guidance on setting the right parameters. These parameters include:

  • dimension = dimensionality of the embedding vector. For us, since we are using the Cohere embed-english-light-v2.0 model, it is 1024.
  • method.name = supported algorithm to perform the kNN search. hnsw is currently supported with an engine type of nmslib.
  • method.space_type = corresponds to the function used to measure the distance between two vectors. In this example, we set space_type='cosinesimil' to denote the cosine similarity distance. There are a variety of other space_types that you may want to select depending on your use case. You can find these in the docs.
  • method.engine = the library to use for indexing/search. When using a CPU, nmslib is the recommended engine option.

When selecting hnsw as the method.name, we have additional parameters for the hnsw algorithm such as ef_construction and m. See the docs for guidance on setting the right parameters.

We are using the Python opensearch-py package to communicate with the OpenSearch cluster in the backend. For demo purposes, we are turning off SSL verification since it is a simple, local cluster.

Create a client like so:

from opensearchpy import OpenSearch

def get_opensearch_client(host="localhost", port=9200) -> OpenSearch:
    # Create the client with SSL/TLS and hostname verification disabled.
    client = OpenSearch(
        hosts=[{"host": host, "port": port}],
        http_compress=True,
        use_ssl=False,
        verify_certs=False,
        ssl_assert_hostname=False,
        ssl_show_warn=False,
    )
    return client

client = get_opensearch_client()

Create an index by first creating the body payload and then submitting that to the index endpoint. The body payload specifies the name of your vectors, the size and various Approximate k-NN parameters discussed above. In this example, our document index will be called arxiv-cosine.

INDEX_NAME = "arxiv-cosine"

body = {
    "settings": {"index": {"knn": "true", "knn.algo_param.ef_search": 100}},
    "mappings": {
        "properties": {
            VECTOR_NAME: {
                "type": "knn_vector",
                "dimension": VECTOR_SIZE,
                "method": {
                    "name": "hnsw",
                    "space_type": "cosinesimil",
                    "engine": "nmslib",
                    "parameters": {"ef_construction": 128, "m": 24},
                },
            },
        }
    },
}
response = client.indices.create(INDEX_NAME, body=body)

You can do a sanity check if your index has been created by running the following line to get all indices currently populated in your OpenSearch backend.

print(client.indices.get_alias("*").keys())

Now that the index is created, we need to populate it with data. We are going to use the cache.jsonl we created in the previous step and the dataset dataframe df to populate the index with documents and their corresponding embedding vectors.

with open("cache.jsonl", "r") as fp:
    cache = json.load(fp)

# insert each row one-at-a-time to the document index
for i, row in tqdm(df.iterrows()):
    text = row.abstract
    try:
        embed = cache[text]
        body = {
            VECTOR_NAME: embed,
            "text": text,
            "title": row.title,
            "arxiv_id": row.id,
            "doi": row.doi,
        }
        response = client.index(index=INDEX_NAME, id=i, body=body)
    except Exception as e:
        print(f"[ERROR]: {e}")
        continue

We have included the abstract, title, arxiv_id, and doi fields in the document index from the data file. Feel free to include more or fewer fields depending on what your application needs.

Using the opensearch-py-ml package, we are able to query our OpenSearch cluster with a pandas-like interface. You can run the following to ensure the documents inserted at your index_name are what you expect in terms of size.

Note: when using the opensearch-py-ml package, ensure you look through the documentation as you cannot do every command as you would in pandas.

oml_df = oml.DataFrame(client, INDEX_NAME)
print(oml_df.shape)

🎉 Your index has been created!

The full script can be run with python create_cosine_index.py.

Step 4: Query Your Index for Similar Documents Using Cohere Embeddings

Great news, now that your index is all set up, you're ready to start querying it!

To get started with the k-NN search, we'll be using vectors, and each query needs to be translated into a vector using Cohere. Once we have the query vector ready, we can submit it to the query endpoint within OpenSearch to find similar vectors using Approximate k-NN.

Not to worry though, we've got some handy helper functions that can make this process a breeze.

import numpy as np 
from typing import List, Union 
def get_cohere_embedding(
    text: Union[str, List[str]], model_name: str = "embed-english-light-v2.0"
) -> List[float]:
    """
    Embed a single text with cohere client and return list of floats
    """
    if type(text) == str:
        embed = co.embed([text], model=model_name).embeddings[0]
    else:
        embed = co.embed(text, model=model_name).embeddings
    return embed


def find_similar_docs(query: str, k: int, num_results: int, index_name: str) -> Dict:
    """
    Main semantic search capability using knn on input query strings.
    Args:
        k: number of top-k similar vectors to retrieve from OpenSearch index
        num_results: number of the top-k similar vectors to retrieve
        index_name: index name in OpenSearch
    """
    embed_vector = get_cohere_embedding(query)

    body = {
        "size": num_results,
        "query": {"knn": {VECTOR_NAME: {"vector": embed_vector, "k": k}}},
    }

    url = f"<http://localhost:9200/{index_name}/_search>"
    response = requests.get(
        url, json=body, headers={"Content-Type": "application/json"}
    )
    return json.loads(response.content)

The above functions search our index using two important parameters:

  • k = the number of neighbors that the hnsw search will return per query. The maximum k supported is 10,000.
  • size = how many results will be returned from the query.
search_output = find_similar_docs(query=query, k=2, num_results=3, index_name=INDEX_NAME)
print(search_output)

Great job. Now, you’re able to search your index semantically, and you can find and retrieve results based on their meaning and context, which is a powerful tool in information retrieval.

A full demo of the semantic search functionality versus the lexical search built into OpenSearch can be viewed in our notebook. If you would like to serve the demo app, you can do so with the following command to spin up a Streamlit app

streamlit run demoapp.py

The following video snippet from the demo illustrates how semantic search becomes an improvement over traditional search approaches. There are three examples shown:

  1. Question search (Phrase: what is cancer): Semantic search does better than the other two methods.
  2. Phrase search (Phrase: cancer research): All of semantic, fuzzy, and lexical search methods return relevant results.
  3. Similar-phrase search (Phrase: cancerous lesions): Only semantic search returns relevant results.

Final Thoughts

In conclusion, this demo showcases how OpenSearch and Cohere can be used together for efficient semantic search. By following the simple steps outlined in this tutorial, users can easily spin up an OpenSearch instance, store Cohere embeddings, and perform retrieval using these embeddings.

This integration offers a powerful solution for anyone looking to perform semantic search on large datasets. With OpenSearch's support for nmslib and faiss engines, and Cohere's high-quality embeddings, the possibilities are endless. We hope this demo has been helpful and encourages further exploration of these tools.

Keep reading