In this chapter, you’ll learn how to build a RAG-powered chatbot that leverages text embeddings using the Chat, Embed, and Rerank endpoints.

We’ll use Cohere’s Python SDK for the code examples. Follow along in this notebook.

Contents

Step-by-Step Guide
- What We'll Build
Setup
Create the Vectorstore Component
Create the Chatbot Component
- Get the User Message
- Generate the Queries
- Retrieve Relevant Chunks and Generate the Response
- Display the Response with Citations
Run the Chatbot
Conclusion

In the previous chapter, you learned how to get started with RAG using the Chat endpoint.

In this chapter, you’ll learn how to add RAG applications that leverage text embeddings and how to build a chatbot using RAG in document mode and multiple Cohere endpoints.

In the previous chapter, we used a short list of simple documents. In real-world applications, however, developers typically need to work with a larger volume of documents that each vary in length.

This is where text embeddings can help. With embeddings, we can split these documents into smaller chunks and build a semantic search system that can retrieve the most relevant chunks to a user query based on contextual meaning, and not just keyword-matching.

Step-by-Step Guide

There are three RAG modes available with the Cohere Chat endpoint:

Document mode: Specifying the documents for the model to use when generating a response
Connectors mode: Connecting the endpoint with an external service that handles all the logic of document retrieval
Query-generation mode: Generating one or more queries given a user message

In this chapter, you’ll learn how to use RAG in document mode, which will also involve the query-generation mode. In the next chapter, you’ll learn how to use RAG in connector mode. Refer to the RAG documentation for more details.

What We'll Build

We’ll build a chatbot that answers users’ questions about the contents in LLM University: What are Large Language Models? Let’s examine the demo application's high-level implementation plan (see the diagram below).

The steps to building a RAG-powered chatbot are summarized below:

Setup phase:

Step 0: Ingest the documents – get documents, chunk, embed, and index

For each user-chatbot interaction:

Step 1: Get the user message
Step 2: Call the Chat endpoint in query-generation mode
If at least one query is generated:
- Step 3: Retrieve and rerank relevant documents
- Step 4: Call the Chat endpoint in document mode to generate a grounded response with citations
If no query is generated:
- Step 4: Call the Chat endpoint in normal mode to generate a response

Throughout the conversation:

Append the user-chatbot interaction to the conversation thread
Repeat with every interaction

To build a RAG system that can effectively handle a complex corpus of documents, we’ll need to use several Cohere API endpoints, including:

This demo application will use Cohere’s Chat, Embed, and Rerank endpoints

For further reading, the API reference page contains a detailed description of the Chat endpoint’s input parameters and response objects.

Setup

First, let’s import the necessary libraries for this project. This includes cohere, hnswlib for the vector library, and unstructured for chunking the documents (more details on these later).

pip install cohere hnswlib unstructured

Then, import the necessary modules from these libraries in addition to other required modules. Let’s also create a Cohere client.

import cohere
import uuid
import hnswlib
from typing import List, Dict
from unstructured.partition.html import partition_html
from unstructured.chunking.title import chunk_by_title

co = cohere.Client("COHERE_API_KEY")

We’ll build two classes that form the key components of the application: Vectorstore and Chatbot.

Two components of this project: Vectorstore and Chatbot

Now, let’s start building the first component: Vectorstore.

Create the Vectorstore Component

The Vectorstore class handles the ingestion of documents into embeddings (or vectors) and the retrieval of relevant documents given a query.

The Vectorstore component handles document ingestion and retrieval

As an example, we’ll use the contents from LLM University: What are Large Language Models? which explains the architecture of large language models. It consists of four web pages, each in the Python list raw_documents below. Each entry is identified by its title and URL.

raw_documents = [
    {
        "title": "Text Embeddings",
        "url": "https://docs.cohere.com/docs/text-embeddings"},
    {
        "title": "Similarity Between Words and Sentences",
        "url": "https://docs.cohere.com/docs/similarity-between-words-and-sentences"},
    {
        "title": "The Attention Mechanism",
        "url": "https://docs.cohere.com/docs/the-attention-mechanism"},
    {
        "title": "Transformer Models",
        "url": "https://docs.cohere.com/docs/transformer-models"}
]

We implement this in the Vectorstore class below, which takes the raw_documents list as input.

class Vectorstore:
    def __init__(self, raw_documents: List[Dict[str, str]]):
        self.raw_documents = raw_documents
        self.docs = []
        self.docs_embs = []
        self.retrieve_top_k = 10
        self.rerank_top_k = 3
        self.load_and_chunk()
        self.embed()
        self.index()

We also initialize a few instance attributes and methods. The attributes include self.raw_documents to represent the raw documents, self.docs to represent the chunked version of the documents, self.docs_embs to represent the embeddings of the chunked documents, and a couple of top_k parameters to be used for retrieval and reranking.

Meanwhile, the methods include load_and_chunk(), embed(), and index() for ingesting raw documents. As you’ll see, we will also specify a retrieve() method to retrieve relevant document chunks given a query.

The document ingestion portion of the Documents component

Load and Chunk the Documents

The load_and_chunk() method loads the raw documents from the URL and breaks them into smaller chunks. Chunking for information retrieval is a broad topic in and of itself, with many strategies being discussed within the AI community. For our example, we’ll utilize the partition_html method from the unstructured library. Read its documentation for more information about its chunking approach.

Each chunk is turned into a dictionary with three fields:

title: The web page’s title
text: The textual content of the chunk
url: The web page’s URL

This information will eventually be passed to the chatbot’s prompt for generating the response, so it’s crucial to populate relevant information into this dictionary. Note that we are not limited to these three fields. At a minimum, the Chat endpoint requires the text field, but beyond that, we can add custom fields that can provide more context about the document, such as subtitles, snippets, tags, and others.

The resulting dictionaries are stored in the self.docs attribute.

class Vectorstore:
    
    ...
    ...    

def load_and_chunk(self) -> None:
        """
        Loads the text from the sources and chunks the HTML content.
        """
        print("Loading documents...")

        for raw_document in self.raw_documents:
            elements = partition_html(url=raw_document["url"])
            chunks = chunk_by_title(elements)
            for chunk in chunks:
                self.docs.append(
                    {
                        "title": raw_document["title"],
                        "text": str(chunk),
                        "url": raw_document["url"],
                    }
                )

Embed the Document Chunks

The embed() method generates embeddings of the chunked documents. We use the Embed endpoint and Cohere's embed-english-v3.0 model. Since the endpoint has a limit of 96 documents per call, we send them in batches.

With the Embed v3 model, we need to define an input_type, of which there are four options depending on the type of task. Using these input types ensures the highest possible quality for the respective tasks. Since our document chunks will be used for retrieval, we use search_document as the input_type.

The resulting chunk embeddings are stored in the self.docs_embs attribute.

class Vectorstore:
    
    ...
    ...

    def embed(self) -> None:
        """
        Embeds the document chunks using the Cohere API.
        """
        print("Embedding document chunks...")

        batch_size = 90
        self.docs_len = len(self.docs)
        for i in range(0, self.docs_len, batch_size):
            batch = self.docs[i : min(i + batch_size, self.docs_len)]
            texts = [item["text"] for item in batch]
            docs_embs_batch = co.embed(
                texts=texts, model="embed-english-v3.0", input_type="search_document"
            ).embeddings
            self.docs_embs.extend(docs_embs_batch)

Index Document Chunks

The index() method indexes the document chunk embeddings. We build an index to store the embeddings in a structured and organized way in order to ensure efficient similarity search during retrieval.

There are many options available for building an index. For production environments, typically a vector database (like Weaviate or MongoDB) is required to handle the continuous process of indexing documents and maintaining the index.

In our example, however, we’ll keep it simple and use a vector library instead. We can choose from many open-source projects, such as Faiss, Annoy, ScaNN, or Hnswlib, which is the one we’ll use. These libraries store embeddings in in-memory indexes and implement approximate nearest neighbor (ANN) algorithms to make similarity search efficient.

The resulting document chunk embeddings are stored in the self.idx attribute.

class Vectorstore:
    
    ...
    ...

    def index(self) -> None:
        """
        Indexes the documents for efficient retrieval.
        """
        print("Indexing documents...")

        self.idx = hnswlib.Index(space="ip", dim=1024)
        self.idx.init_index(max_elements=self.docs_len, ef_construction=512, M=64)
        self.idx.add_items(self.docs_embs, list(range(len(self.docs_embs))))

        print(f"Indexing complete with {self.idx.get_current_count()} documents.")

Implement Retrieval

The retrieve() method uses semantic search to retrieve relevant document chunks given a query, and it has two steps: (1) dense retrieval, (2) reranking.

A more detailed view of document ingestion, retrieval, and reranking

Dense Retrieval

We implement a dense retrieval system that leverages embeddings to retrieve document chunks, offering significant improvements over basic keyword-matching approaches. Embeddings can capture the contextual meaning of a document, thus enabling the retrieval of highly relevant results to the given query.

We embed the query using the same embed-english-v3.0 model that we used to embed the document chunks, but this time, we set input_type=”search_query”.

Search is performed by the knn_query() method from the hnswlib library. Given a query, it returns the document chunks most similar to the query. We define the number of document chunks to return using the attribute self.retrieve_top_k=10.

Reranking

After dense retrieval, we implement a reranking step. While our dense retrieval component is already highly capable of retrieving relevant sources, the Rerank endpoint provides an additional boost to the quality of the search results, especially for complex and domain-specific queries. It takes the search results and sorts them according to their relevance to the query.

We call the Rerank endpoint with co.rerank() and pass the query and the list of document chunks to be reranked. We also define the number of top reranked document chunks to retrieve using the attribute self.rerank_top_k=3. The model we use is rerank-english-v3.0, which lets you rerank documents that contain multiple fields, in the form of JSON objects. In our case, we'll use the title and text fields for reranking.

This method returns the top retrieved document chunks as a Python list docs_retrieved, so that they can be passed to the chatbot, which we’ll implement next.

class Vectorstore:

    ...
    ...

    def retrieve(self, query: str) -> List[Dict[str, str]]:
        """
        Retrieves document chunks based on the given query.

        Parameters:
        query (str): The query to retrieve document chunks for.

        Returns:
        List[Dict[str, str]]: A list of dictionaries representing the retrieved document chunks, with 'title', 'text', and 'url' keys.
        """

        # Dense retrieval
        query_emb = co.embed(
            texts=[query], model="embed-english-v3.0", input_type="search_query"
        ).embeddings

        doc_ids = self.idx.knn_query(query_emb, k=self.retrieve_top_k)[0][0]

        # Reranking
        rank_fields = ["title", "text"] # We'll use the title and text fields for reranking

        docs_to_rerank = [self.docs[doc_id] for doc_id in doc_ids]

        rerank_results = co.rerank(
            query=query,
            documents=docs_to_rerank,
            top_n=self.rerank_top_k,
            model="rerank-english-v3.0",
            rank_fields=rank_fields
        )

        docs_retrieved = []
        for doc_id in doc_ids_reranked:
            docs_retrieved.append(
                {
                    "title": self.docs[doc_id]["title"],
                    "text": self.docs[doc_id]["text"],
                    "url": self.docs[doc_id]["url"],
                }
            )

        return docs_retrieved

Process the Documents

We can now process the raw documents. We do that by creating an instance of Vectorstore. In our case, we get a total of 136 documents, chunked from the four web URLs.

vectorstore = Vectorstore(raw_documents)

Loading documents...
Embedding documents...
Indexing documents...
Indexing complete with 134 documents.

Test Retrieval

Before going further, we first test the document retrieval part of the system. First, we create an instance of the Vectorstore with the raw documents that we have defined. Then, we use the retrieve method to retrieve the most relevant documents to the query multi-head attention definition.

vectorstore.retrieve("multi-head attention definition")

And here’s the response. We can see that the document chunks returned are indeed highly relevant to the query we sent.

[{'title': 'Transformer Models',
  'text': 'The attention step used in transformer models is actually much more powerful, and it’s called multi-head attention. In multi-head attention, several different embeddings are used to modify the vectors and add context to them. Multi-head attention has helped language models reach much higher levels of efficacy when processing and generating text.',
  'url': 'https://docs.cohere.com/docs/transformer-models'},
 {'title': 'The Attention Mechanism',
  'text': "What you learned in this chapter is simple self-attention. However, we can do much better than that. There is a method called multi-head attention, in which one doesn't only consider one embedding, but several different ones. These are all obtained from the original by transforming it in different ways. Multi-head attention has been very successful at the task of adding context to text. If you'd like to learn more about the self and multi-head attention, you can check out the following two",
  'url': 'https://docs.cohere.com/docs/the-attention-mechanism'},
 {'title': 'Transformer Models',
  'text': 'Attention helps give context to each word, based on the other words in the sentence (or text).',
  'url': 'https://docs.cohere.com/docs/transformer-models'}]

Create the Chatbot Component

The Chatbot class handles the interaction between the user and the chatbot. It also handles the logic of the chatbot, including generating search queries based on a user message, and retrieving documents.

The Chatbot component handles the chatbot logic, from getting the user message to generating the response

The Chatbot class takes an instance of the Vectorstore class. We initialize a self.vectorstore attribute for that instance, as well as a unique conversation ID that we’ll need for each conversation.

class Chatbot:
    def __init__(self, vectorstore: Vectorstore):
        """
        Initializes an instance of the Chatbot class.

        Parameters:
        vectorstore (Vectorstore): An instance of the Vectorstore class.

        """
        self.vectorstore = vectorstore
        self.conversation_id = str(uuid.uuid4())

Get the User Message

Next, we create a run() method that will be used to run the chatbot application. It begins with the logic for getting the user message, along with a way for the user to end the conversation.

class Chatbot:

    ...


    def run(self):
        """
        Runs the chatbot application.

        """
        while True:
            # Get the user message
            message = input("User: ")

            # Typing "quit" ends the conversation
            if message.lower() == "quit":
                print("Ending chat.")
                break
            else:
                print(f"User: {message}")

Generate the Queries

Based on the user message, the chatbot needs to decide if it needs to consult external information before responding. If so, the chatbot determines an optimal set of search queries to use for retrieval. When we call co.chat() with search_queries_only=True, the Chat endpoint handles this for us automatically.


...

while True:
...
    # Generate search queries, if any
    response = co.chat(message=message, search_queries_only=True)

The generated queries can be accessed from the search_queries field of the object that is returned. To understand how this works, let’s look at a few scenarios:

No query needed: Suppose we have a user message of “Hello, I need help with a report I'm writing”. This type of message doesn’t require any additional context from external information, so retrieval is not required. A direct chatbot response will suffice (for example: “Sure, how can I help?”). When we send this to the Chat endpoint, we get an empty search_queries result, which is what we expect.
One query generated: Take this user message: "What did the report say about the company's Q4 performance?” This does require additional context as it refers to a report, hence retrieval is required. Given this message, the Chat endpoint returns the search_queries result of Q4 company performance. Here it turns the user message into a query optimized for search. Another important scenario is generating queries in the context of the conversation. Suppose there’s an ongoing conversation where the user is learning from the chatbot about deep learning. If at some point, the user asks, “Why is it important”, then the generated search_queries will become why is deep learning important, providing the much-needed context for the retrieval process.
More than one query generated: What if the user message is a bit more complex, such as "What did the report say about the company's Q4 performance and its range of products and services?” This requires multiple pieces of information to be retrieved. Given this message, the Chat endpoint returns two search_queries results: Q4 company performance and company's range of products and services.

These scenarios highlight the adaptability of the Chat endpoint to decide on the next course of action based on a user message.

Retrieve Relevant Chunks and Generate the Response

What happens next depends on how many search queries are returned.

If search queries are returned

If the chatbot response contains at least one search query, we call the retrieve() method from the Vectorstore class instance to retrieve document chunks that are relevant to the queries.

Then, we call the Chat endpoint to generate a response, adding a documents parameter to the call to pass the relevant document chunks.

If no search queries are returned

Meanwhile, if the chatbot response doesn’t contain any search queries, then it doesn’t require information retrieval. To generate the response, we call the Chat endpoint another time, passing the user message and without needing to add any sources to the call.

...

while True
...
    # If there are search queries, retrieve document chunks and respond
    if response.search_queries:
        print("Retrieving information...", end="")
    
        # Retrieve document chunks for each query
        documents = []
        for query in response.search_queries:
            documents.extend(self.vectorstore.retrieve(query.text))
    
        # Use document chunks to respond
        response = co.chat_stream(
            message=message,
            model="command-r",
            documents=documents,
            conversation_id=self.conversation_id,
        )
    
    # If there is no search query, directly respond
    else:
        response = co.chat_stream(
            message=message,
            model="command-r",
            conversation_id=self.conversation_id,
        )

In either case, we also pass the conversation_id parameter, which retains the interactions between the user and the chatbot in the same conversation thread. We also enable the stream parameter, so we can stream the chatbot response to the application.

Display the Response with Citations

The chatbot response includes a stream of events, such as the generated text and citations followed by a final object which contains the sources used by the chatbot along with other details.

To display the response, we use the text-generation events from the response stream.

The citation-generation events indicate the spans of text from the retrieved document chunks on which the response is grounded. Here is one example:

start=382 end=397 text='similar vectors' document_ids=['doc_0', 'doc_2']

The format of each citation is:

start: The starting point of a span where one or more documents are referenced
end: The ending point of a span where one or more documents are referenced
text: The text representing this span
document_ids: The IDs of the document chunks being referenced (doc_0 being the ID of the first document chunk passed to the documents creating parameter in the endpoint call, and so on)

The final response object includes a list of the document chunks, which we access from the documents attribute.

...

while True
...
  # Print the chatbot response, citations, and documents
  print("\nChatbot:")
  citations = []
  cited_documents = []
  
  # Display response
  for event in response:
      if event.event_type == "text-generation":
          print(event.text, end="")
      elif event.event_type == "citation-generation":
          citations.extend(event.citations)
      elif event.event_type == "search-results":
          cited_documents = event.documents
  
  # Display citations and source documents
  if citations:
    print("\n\nCITATIONS:")
    for citation in citations:
      print(citation)
  
    print("\nDOCUMENTS:")
    for document in cited_documents:
      print(document)
  
  print(f"\n{'-'*100}\n")

Run the Chatbot

We can now run the chatbot app. For this, we create an instance of Chatbot. Then, we run the chatbot by invoking the run() method.

Here’s an example of a conversation that happens over a few turns:

User: Hello, I have a question

Chatbot:
Hello! What's your question? I'm here to help you in any way I can.
----------------------------------------------------------------------------------------------------

User: What’s the difference between word and sentence embeddings
Retrieving information...
Chatbot:
Word embeddings associate words with lists of numbers. Similar words are assigned numbers that are mathematically close while dissimilar words are assigned numbers that are far apart. 

Sentence embeddings do the same thing as word embeddings, but for sentences. Each sentence is associated with a vector of numbers in a coherent way. This means that similar sentences are assigned similar vectors and dissimilar sentences are assigned different vectors.

CITATIONS:
start=0 end=15 text='Word embeddings' document_ids=['doc_0']
start=16 end=54 text='associate words with lists of numbers.' document_ids=['doc_0']
start=55 end=68 text='Similar words' document_ids=['doc_0']
start=82 end=119 text='numbers that are mathematically close' document_ids=['doc_0']
start=126 end=142 text='dissimilar words' document_ids=['doc_0']
start=156 end=183 text='numbers that are far apart.' document_ids=['doc_0']
start=186 end=205 text='Sentence embeddings' document_ids=['doc_0', 'doc_2']
start=213 end=242 text='same thing as word embeddings' document_ids=['doc_0', 'doc_2']
start=263 end=276 text='Each sentence' document_ids=['doc_0', 'doc_2']
start=298 end=315 text='vector of numbers' document_ids=['doc_0', 'doc_2']
start=321 end=329 text='coherent' document_ids=['doc_2']
start=351 end=368 text='similar sentences' document_ids=['doc_0', 'doc_2']
start=382 end=397 text='similar vectors' document_ids=['doc_0', 'doc_2']
start=402 end=422 text='dissimilar sentences' document_ids=['doc_0', 'doc_2']
start=436 end=454 text='different vectors.' document_ids=['doc_0', 'doc_2']

DOCUMENTS:
{'id': 'doc_0', 'text': 'In the previous chapters, you learned about word and sentence embeddings and similarity between words and sentences. In short, a word embedding is a way to associate words with lists of numbers (vectors) in such a way that similar words are associated with numbers that are close by, and dissimilar words with numbers that are far away from each other. A sentence embedding does the same thing, but associating a vector to every sentence. Similarity is a way to measure how similar two words (or', 'title': 'The Attention Mechanism', 'url': 'https://docs.cohere.com/docs/the-attention-mechanism'}
{'id': 'doc_1', 'text': 'Sentence embeddings\n\nSo word embeddings seem to be pretty useful, but in reality, human language is much more complicated than simply a bunch of words put together. Human language has structure, sentences, etc. How would one be able to represent, for instance, a sentence? Well, here’s an idea. How about the sums of scores of all the words? For example, say we have a word embedding that assigns the following scores to these words:\n\nNo: [1,0,0,0]\n\nI: [0,2,0,0]\n\nAm: [-1,0,1,0]\n\nGood: [0,0,1,3]', 'title': 'Text Embeddings', 'url': 'https://docs.cohere.com/docs/text-embeddings'}
{'id': 'doc_2', 'text': 'This is where sentence embeddings come into play. A sentence embedding is just like a word embedding, except it associates every sentence with a vector full of numbers, in a coherent way. By coherent, I mean that it satisfies similar properties as a word embedding. For instance, similar sentences are assigned to similar vectors, different sentences are assigned to different vectors, and most importantly, each of the coordinates of the vector identifies some (whether clear or obscure) property of', 'title': 'Text Embeddings', 'url': 'https://docs.cohere.com/docs/text-embeddings'}

----------------------------------------------------------------------------------------------------

User: And what are their similarities
Retrieving information...
Chatbot:
The similarity between word and sentence embeddings lies in the fact that they both measure similarity between items. For example, if two sentences are very similar, their corresponding vectors will also be similar. This is best illustrated with an example: 

The similarities between the following sentences can be computed using sentence embeddings:
1. Who was the 16th president of the US and fought in the American Civil War?
2. The American Civil War saw the 16th President, Abraham Lincoln, attempt to preserve the Union.
3. Lincoln was the 16th president of the United States.

The similarity between sentences 1 and 2 is 6738.2859, which is very high. On the other hand, the similarities between sentences 1 and 3, and 2 and 3, are much lower at -122.2267 and -3.4946 respectively.

CITATIONS:
start=84 end=102 text='measure similarity' document_ids=['doc_0', 'doc_2']
start=131 end=215 text='if two sentences are very similar, their corresponding vectors will also be similar.' document_ids=['doc_0']
start=589 end=638 text='similarity between sentences 1 and 2 is 6738.2859' document_ids=['doc_1']
start=683 end=734 text='similarities between sentences 1 and 3, and 2 and 3' document_ids=['doc_1']
start=754 end=775 text='-122.2267 and -3.4946' document_ids=['doc_1']

DOCUMENTS:
{'id': 'doc_0', 'text': 'Notice that these sentences are all very similar. In particular, the three highlighted sentences pretty much have the same meaning. If you look at their corresponding vectors, these are also really similar. That is exactly what an embedding should do.', 'title': 'Text Embeddings', 'url': 'https://docs.cohere.com/docs/text-embeddings'}
{'id': 'doc_1', 'text': 'And the results are:\n\nThe similarity between sentences 1 and 2: 6738.2858668486715\n\nThe similarity between sentences 1 and 3: -122.22666955510499\n\nThe similarity between sentences 2 and 3: -3.494608113647928\n\nThese results certainly confirm our predictions. The similarity between sentences 1 and 2 is 6738, which is high. The similarities between sentences 1 and 3, and 2 and 3, are -122 and -3.5 (dot products are allowed to be negative too!), which are much lower.', 'title': 'Similarity Between Words and Sentences', 'url': 'https://docs.cohere.com/docs/similarity-between-words-and-sentences'}
{'id': 'doc_2', 'text': 'The similarity between each sentence and itself is 1 (the diagonal in the plot), which is consistent with our expectations. Furthermore, a sentence and itself represent the same point in space, which gives an angle of 0 with the origin, so it makes sense that the similarity is the cosine of 0, which is 1!\n\nConclusion', 'title': 'Similarity Between Words and Sentences', 'url': 'https://docs.cohere.com/docs/similarity-between-words-and-sentences'}

----------------------------------------------------------------------------------------------------

User: What do you know about 5G networks
Retrieving information...
Chatbot:
Unfortunately, I could not find any information about 5G networks in the available documentation. However, I can tell you about the 4G networks which have preceded 5G. 4G networks enable a high-speed connection and were designed to support a wide range of functions on mobile devices, including video streaming and high-quality music streaming. They also support a wider coverage area and better spectral efficiency, allowing more devices to connect simultaneously.
----------------------------------------------------------------------------------------------------

User: quit
Ending chat.

In the conversation above, notice a few observations that reflect the different components of what we built:

Direct response: For user messages that don’t require retrieval (“Hello, I have a question”), the chatbot responds directly without requiring retrieval.
Citation generation: For responses that do require retrieval (“What’s the difference between word and sentence embeddings”), the endpoint returns the response together with the citations.
State management: The endpoint maintains the state of the conversation via the conversation_id parameter, for example, by being able to correctly respond to a vague user message of “And what are their similarities”
Response synthesis: The model can decide if none of the retrieved documents provide the necessary information required to answer a user message. For example, when asked the question, “What do you know about 5G networks”, the chatbot goes on and retrieves external information from the index. However, it doesn’t use any of the information in its response as none of them is relevant to the question.

Conclusion

In this chapter, you learned how to build a RAG-powered chatbot with the Chat endpoint. With access to a collection of documents, the chatbot is able to provide contextually relevant responses to user requests, along with verifiable citations.

We used the Chat endpoint in document mode. This mode highlights the modularity of the endpoint, giving developers the flexibility to customize each component of the system.

An alternative to this is connectors mode. It abstracts away some of the steps we saw in the documents mode, which makes it simpler to build applications. It also makes it easy to connect to enterprise data sources and do that at scale.

Continue to the next chapter to learn about connectors and how to build RAG applications using the web search connector.

About Cohere’s LLM University

Our comprehensive curriculum aims to equip you with the skills to develop your own AI applications. We cater to learners from all backgrounds, covering everything from the basics to the most advanced topics in large language models (LLMs). Plus, you'll have the opportunity to work on hands-on exercises, allowing you to build and deploy your very own solutions. Take a course today.

This LLMU module consists of the following chapters: