Note: This post comes with a corresponding Colab notebook, and we encourage you to follow it along as you read it.

TL;DR

This blog post illustrates the differences between keyword and semantic search of a Wikipedia dataset using Cohere and Weaviate. It also shows the improvements obtained using Cohere’s Rerank endpoint. Finally, it shows how to combine search with the Cohere Chat endpoint, in order to answer questions in sentence format.

Introduction

In a previous blog post, you learned the difference between keyword search and dense retrieval (one of the main semantic search methods), and how dense retrieval shows a huge improvement over keyword search as it captures the semantics of the text.

Cohere Rerank uses a mechanism that assigns a relevance score to each query-response (or query-document) pair. Those that have high scores are very likely to contain a question and its corresponding answer. When we combine either keyword search or dense retrieval with Rerank, the results significantly improve.

Generative models are able to respond to questions, but they also have a problem with hallucination. More specifically, a generative model may answer a question with a statement that, while sounding true, is not the correct answer. A way to reduce hallucinations in a generative model is to combine it with a search system. First, the search system searches for text or documents that are very likely to contain the answer. This text is called the context. The query and the context are given to a generative model, and the model is prompted to answer the question using the context only. This results in models that output much more accurate responses.

In this blog post, you’ll see all these concepts in action. We’ll walk through an example query use case using a Wikipedia dataset. This is a Weaviate demo dataset containing 10 million Wikipedia vectors. Note that in this demo, the embeddings are precomputed. If they weren’t, you could calculate them using the co.embed endpoint.

💡

If you enjoy this content, be sure to check out more in LLM University!

Using a Vector Database

In order to store the Wikipedia dataset query, we’ll use the Weaviate vector database, which will give us a range of benefits. In simple terms, a vector database is a place where one can store data objects and vector embeddings, and be able to access them and perform operations easily. For example, finding the nearest neighbors of a vector in a dataset is a lengthy process, which is sped up significantly by using a vector database. This is done with the following code.

import weaviate
import cohere

# Add your Cohere API key here
# You can obtain a key by signing up in https://dashboard.cohere.com/ or https://docs.cohere.com/reference/key
cohere_api_key = ''

co = cohere.Client(cohere_api_key)

# Connect to the Weaviate demo databse containing 10M wikipedia vectors
# This uses a public READ-ONLY Weaviate API key
auth_config = weaviate.auth.AuthApiKey(api_key="76320a90-53d8-42bc-b41d-678647c6672e")
client = weaviate.Client(
    url="https://cohere-demo.weaviate.network/",
    auth_client_secret=auth_config,
    additional_headers={
        "X-Cohere-Api-Key": cohere_api_key,
    }
)

Querying the Wikipedia Dataset Using Keyword Matching

To use keyword matching, we’ll first define the following function for keyword search. In this function, we’ll tell the vector database what properties we want from each retrieved document. We’ll also filter them to the English language (using results_lang), but feel free to explore searching in other languages as well!

def keyword_search(query, results_lang='en', num_results=10):
properties = ["text", "title", "url", "views", "lang", "_additional {distance}"]


where_filter = {
"path": ["lang"],
"operator": "Equal",
"valueString": results_lang
}


response = (
client.query.get("Articles", properties)
.with_bm25(
query=query
)
.with_where(where_filter)
.with_limit(num_results)
.do()
)
result = response['data']['Get']['Articles']
return result

We’ll use two search queries, of varying difficulty.

Simple query: “Who discovered penicillin?”
Hard query: “Who was the first person to win two Nobel prizes?”

The responses for these queries are “Alexander Fleming”, and “Marie Curie”. Now let’s see how keyword search does. Here are the top three results for each query (some results are repeated, so let’s look at the three top distinct ones).

Query 1: “Who discovered penicillin?”

Responses:

Penicillin
Antibiotic
Alexander Fleming

As you can see, keyword search did quite well. All three articles contain the answer, and in particular, the third one is the correct response: Alexander Fleming.

Now let’s see how it did with the more complicated query.

Query 2: “Who was the first person to win two Nobel prizes?”

Responses:

Neutrino
Western culture
Reality television

This time, keyword search was very far from finding the answer. If you explore the articles, you may notice that they contain several mentions of words such as “first”, “person”, “Nobel”, “prizes”, but none of them have any information on the first person to win two Nobel prizes. In fact, the neutrino article mentions a scientist who won two Nobel prizes, but this wasn’t the first person to achieve this feat.

As you can see, keyword search can be good for queries, like “Who discovered penicillin?”, in which you’d expect the answers to have a lot of words in common with the query. More specifically, if an article contains the words “discovered,” and “penicillin”, it’s also likely to contain the fact that Alexander Fleming discovered it.

With harder queries like “Who was the first person to win two Nobel prizes?”, keyword search doesn’t do well. The reason is that the words in the query would appear in many instances without necessarily talking about something as specific as the first person who won two Nobel prizes. By matching words, we haven’t yet exploited the semantics of the sentence. A model that understands what we mean by “the first person to win two Nobel prizes” would be able to find the answer, which is exactly what dense retrieval does (see the next section).

Querying the Dataset Using Dense Retrieval

Dense retrieval uses a text embedding in order to search for documents that are similar to a query. If you’d like to learn more about embeddings, please take a look at this blog post. Embeddings assign a vector (long list of numbers) to each piece of text. One of the main properties in an embedding is that similar pieces of text go to similar vectors.

In short, dense retrieval consists of the following:

Finding the embedding vector corresponding to the query
Finding the embedding vectors corresponding to each of the responses (in this case, Wikipedia articles)
Retrieving the response vectors that are closest to the query vector in the embedding

Dense retrieval finds the closest documents to the query in the embedding

To use dense retrieval, we’ll first define the following function. Just like with keyword search, we’ll tell the vector database what properties we want from each retrieved document, and filter them to the English language (using results_lang).

def dense_retrieval(query, results_lang='en', num_results=10):

    nearText = {"concepts": [query]}
    properties = ["text", "title", "url", "views", "lang", "_additional
    {distance}"]

    # To filter by language
    where_filter = {
        "path": ["lang"],
        "operator": "Equal",
        "valueString": results_lang
        }
    response = (
        Client.query
        .get("Articles", properties)
        .with_near_text(nearText)
        .with_where(where_filter)
        .with_limit(num_results)
        .do()
    )

    result = response['data']['Get']['Articles']
    return result

Chunking the Articles

This process of finding the closest documents to a query in an embedding will yield good results. However, articles may be very long and this could make things complicated. In order to have more granularity, we’ll split the articles by paragraph. This means that we’ll find the embedding vector corresponding to each paragraph of each article in the Wikipedia dataset. In that way, when the model retrieves the answer, it will actually output the paragraph that it found the most similar to the query, as well as the article in which this paragraph belongs.

Wikipedia articles get chunked by paragraph, and each chunk gets assigned an embedding vector

Back to Querying the Dataset

Let’s review the two queries we used.

Simple query: “Who discovered penicillin?”
Hard query: “What was the first person to win two Nobel prizes?”

Now, let’s look at the three top results for each query using dense retrieval. Recall that the responses here are at the paragraph level, so the model may sometimes retrieve the same article several times by outputting different paragraphs from the same article.

Query 1: “Who discovered penicillin?”

Responses:

Alexander Fleming: “Sir Alexander Fleming (6 August 1881 - 11 March 1995) was a Scottish physician and microbiologist …”
Penicillin: “Penicillin was discovered in 1928 by Scottish scientist Alexander Fleming …”
Penicillin: “The term “penicillin” is defined as the natural product of “Penicillium” mould with antimicrobial activity. It was coined by Alexander Fleming ...”

As you can see, dense retrieval did quite well by finding paragraphs that contain the exact answer. Now, let’s see how it did with the more complicated query.

Query 2: “Who was the first person to win two Nobel prizes?”

Responses:

As you can see, dense retrieval did much better than keyword search here. The second, third, and fourth results are in the correct documents (Nobel prize and Marie Curie), and in fact, the third and fourth results are in a paragraph which explicitly contains the answer. The reason for this is that the embedding captures the semantics of the text, and is able to see if two pieces of text have a similar meaning, even if they don’t necessarily share many words in common.

For both keyword search and dense retrieval, and in fact, for any other search mechanism we use, Cohere’s Rerank provides a very powerful method to enhance it. The Rerank endpoint assigns a relevance score to each query-response pair. As the name hints, relevance scores are high for pairs in which the response is relevant to the query, and low otherwise.

Let’s look at how we can use Rerank to improve our Wikipedia search results from the previous sections.

Using Rerank to Improve Keyword Search

Rerank is a very powerful method which can significantly boost any existing search system. In short, rerank takes a query and a response, and outputs a relevance score between them. In that way, one can use any search system to surface a number of documents that can potentially contain the answer to a query, and then sort them using Rerank.

The results from any search system get reranked based on their relevance to the query

Remember that the results we obtained for the query “Who was the first person to win two Nobel prizes” using the keyword_search function were the following (for the full text, please check out the Colab notebook):

Query: “Who was the first person to win two Nobel prizes?”

Responses:

These could contain the answer somewhere in the document, but they are certainly not the best documents for this query. Let’s dig in a bit more, and find the first 100 results. To save space, I’ll only note the top 20 titles.

Neutrino
Western culture
Reality television
Peter Mullan
Indiana Pacers
William Regal
Nobel Prize
Nobel Prize
Nobel Prize
Noble gas
Nobel Prize in Literature
D.C. United
Nobel Prize in Literature
2021-2022 Manchester United F.C. season
Nobel Prize
Nobel Prize
Zach LaVine
2011 Formula One World Championship
2021-2022 Manchester United F.C. season
Christians

Ok, there’s a high chance that the answer is there. Let’s see if Rerank can help us find it. The following function calls the Rerank endpoint. Its inputs are the query, the responses, and the number of responses we’d like to retrieve.

def rerank_responses(query, responses, num_responses=10):
    reranked_responses = co.rerank(
        model = 'rerank-english-v2.0',
        query = query,
        documents = responses,
        top_n = num_responses,
    )
    return reranked_responses

Rerank will output the result, as well as the relevance score. Let’s look at the top 3 results.

Query: “Who was the first person to win two Nobel prizes?”

Responses:

Nobel Prize: “Five people have received two Nobel Prizes. Marie Curie received the …”
Relevance score: 0.98109454
Neutrino: “In the 1960s, the now-famous Homestake experiment …”
Relevance score: 0.9334308
Alfred Nobel: “Nobel was elected a member of the Royal Swedish Academy of Sciences …”
Relevance score: 0.82046944

Well, that certainly improved the keyword search results! Even though the second and third results don’t work, the first one retrieved the correct article, which is the one that contains the answer. Notice that the relevance score is close to 1.

Generating Answers

Earlier in this post, you’ve learned how to search and retrieve information from large databases in very effective ways. In this chapter, you’ll learn how to combine this with a generative model in order to get an answer in sentence format instead of a list of search results.

Large language models, as you know, are very good at answering questions, but they are prone to some limitations, such as incorrect information or even hallucinations. A good way to fix this is to enhance an LLM with a search mechanism.

In short, this combination is done in the following way:

Given a query, the search mechanism retrieves one or more documents containing the answer.
These documents are given to the large language model and it is instructed to generate an answer based on that information.

I like to imagine this the following way. If I have a question about thermodynamics, I can pick a random friend of mine and ask them that question. They may or may not get the answer wrong. But, if I go and search a few chapters in books about thermodynamics, give them to my friend, and then ask them to answer the question based on those chapters, they are much more likely to answer the question correctly.

A generative model receives a query and outputs a response. This response may be inaccurate

The query is first given to a search system, which retrieves documents which are likely to contain the answer. Then the query and the documents (context) are fed to the generative model, for a more accurate response.

Generating Answers (Without Search)

Let’s first use a generative model to answer a slightly harder question — without search. We are trying to find out how many people won more than one Nobel prize. So, we ask the model the following query.

Query: “How many people have won more than one Nobel prize?”

The answer to this question is five: Marie Curie, Linus Pauling, John Bardeen, Frederick Sanger, and Karl Barry Sharpless.

The way to ask this to the model is with the following line of code, which calls the co.chat endpoint.

prediction_without_search = [
    co.chat(
        message=query,
        max_tokens=50,
        preamble="")
    for _ in range(5)]

We call the endpoint five times to get five responses. The max_tokens parameter determines the length of the answer (which is why some answers appear truncated). We also set the preamble parameter to an empty string, to ensure that the responses are relatively brief.

Responses:

There have been 24 Nobel Prize recipients who have won the prize more than once. Among them, John Bardeen has won the most Nobel Prizes, with four Nobel Prizes in Physics. Marie Curie has won the most Nobel Prizes among women, with the
There have been 25 Nobel Prize recipients who have won more than one Nobel Prize. Of these, 14 are men and 11 are women. The most Nobel Prizes won by a single person is four, which was achieved by Marie Curie, who won the
There are currently 18 individuals who have received more than one Nobel Prize. Of those, seven are still living.
There have been 24 Nobel Laureates who have won the Nobel Prize more than once.
There are currently 13 people who have won more than one Nobel Prize. The most Nobel Prizes won by a single person is four, which is the case for Marie Curie. She won the Nobel Prize in Physics in 1903, in 1911, and in

These answers sound like they could be correct, but they’re actually all wrong. One reason for this is that transformers are good at talking and understanding sentiment and nuisances of the language, etc., but not so good at storing information. As a matter of fact, storing information inside the nodes of the neural network is not something that we can (or should!) fully trust.

Instead, let’s first search for the answer using what we’ve learned in the previous sections of this post.

Searching Answers

In order to find the answer to this question in the Wikipedia dataset (the one we’ve been working with throughout this post), we can use the same dense_retrieval function that we used before. For simplicity, we will only use dense retrieval without Rerank, but we invite you to add it to the lab and see how the results improve!

responses = dense_retrieval(query, num_results=20)

This retrieves the top 20 articles, with their corresponding paragraphs. Here are the top three (remember that the search is done by finding the most similar paragraphs to the query, so some articles may appear several times with different paragraphs).

Responses:

Nobel Peace Prize: “, the Peace prize has been awarded to 110 individuals and 27 organizations …”
Nobel Prize: “The strict rule against awarding a prize to more than three people is also controversial …”
Nobel Prize: “The prize ceremonies take place annually …”

Next, we’ll feed these 20 paragraphs to a generative model, and instruct it to answer the question in sentence format.

Generating an Answer from the Search Results

In order to get the generative model to answer a question based on a certain context, we need to create a prompt. And in this prompt, we need to give it a command and a context. The context will be the concatenation of all the paragraphs retrieved in the search step, which we can obtain using this line of code:

context = [r['text'] for r in responses]

The array context contains a lot of text, and, given the good results we’ve been obtaining with search mechanisms, we are fairly confident that somewhere in this text lies the answer to our original question. Now, we invoke the Chat endpoint. The prompt we’ll use is the following.

prompt = f"""
Use the information provided below to answer the questions at the end. If the answer to the question is not contained in the provided information, say "The answer is not in the context".
---
Context information:
{context}
---
Question: How many people have won more than one Nobel prize?
"""

In other words, we’ve prompted the model to answer the question, but only from information coming from the context array. And if the information is not there, we are prompting the model to state that the answer is not in the context. The following line of code will run the prompt.

prediction_with_search = [
    co.chat(
        message=prompt,
        max_tokens=50,
        preamble="")
    for _ in range(5)]

The five responses we get are the following (just like before, some of them are truncated):

The answer is Five people have received two Nobel Prizes.
The answer is Five people have received two Nobel Prizes. Marie Curie received the Physics Prize in 1903 for her work on radioactivity and the Chemistry Prize in 1911 for the isolation of pure radium, making her the only person to be awarded a Nobel
The answer is Five people have received two Nobel Prizes.
The answer is The Curie family has received the most prizes, with four prizes awarded to five individual laureates. Marie Curie received the prizes in Physics (in 1903) and Chemistry (in 1911). Her husband, Pierre Curie, shared the
The answer is Five people have received two Nobel Prizes. Marie Curie received the Physics Prize in 1903 for her work on radioactivity and the Chemistry Prize in 1911 for the isolation of pure radium, making her the only person to be awarded a Nobel

As you can see, this improved the quality of the answers. All of them get the right number of people who received more than one Nobel prize, which is 5.

Final Thoughts

Dense retrieval offers a considerable improvement in quality of results over keyword search as it searches using the semantics of the text. In addition to this, Cohere’s Rerank endpoint offers a considerable improvement of these results. These, when used with the Cohere Chat endpoint, result in a model that answers questions more accurately in sentence format.

Get started with building search applications with Cohere’s models.