Welcome to LLM University's module on Retrieval-Augmented Generation (RAG)!

By the end of this module, you will be able to build RAG-powered applications by leveraging various Cohere endpoints—Chat, Embed, and Rerank. You will also learn how to use quickstart connectors, which are pre-built implementations that connect a RAG application to over 80 enterprise datastores.

This module consists of the following chapters:

Introduction to RAG (this chapter): Learn the basics of RAG and how to get started with RAG via the Chat endpoint.
RAG with Chat, Embed, and Rerank: Learn how to build a RAG-powered chatbot using the Chat, Embed, and Rerank endpoints.
RAG with Connectors: Learn about connectors and how to build RAG applications using the web search connector.
RAG with Quickstart Connectors: Learn how to connect RAG applications to datastores by leveraging Cohere’s pre-built quickstart connectors.
RAG over Large-Scale Data: Learn how to build RAG applications over multiple datastores and long documents.

We’ll use Cohere’s Python SDK for the code examples. Follow along in this notebook.

Contents

What Is RAG?
RAG with Cohere
Try It with Coral
Step-by-Step Guide
Setup
Define the Documents
Generate the Response with Citations
Conclusion

What Is RAG?

While LLMs are good at maintaining the context of the conversation and generating responses, they can be prone to hallucinate and include factually incorrect or incomplete information in their responses.

Retrieval-augmented generation (RAG) is a technique that enhances the performance of LLMs by incorporating external data sources. This approach significantly reduces the hallucination issue common in LLMs. RAG enables the model to access and utilize supplementary information from external documents, thereby improving the accuracy of its responses.

In a previous module, we discussed how to build a chatbot using Cohere’s Chat endpoint. In this module, we’ll discuss the endpoint's RAG capabilities. This means you can build chatbots that can connect to external documents, ground their responses on these documents, and produce inline citations in their responses.

The chatbot provides helpful and verifiable responses through citations

Having RAG in a chat paradigm means you can build context-aware applications that are able to both maintain the state of a conversation and generate grounded responses.

The Chat endpoint adds RAG capabilities to the chat paradigm

RAG with Cohere

The Cohere Chat endpoint comes with RAG features already integrated. This greatly simplifies the task of developing RAG-powered applications.

With Cohere Chat, you get the complete set of building blocks needed to build a high-quality RAG application in the shortest time possible. We’ll cover them in depth throughout this module, but first, let’s take a quick look at some key capabilities of Cohere’s RAG solution.

Chat interface: The RAG functionalities run on the Chat endpoint. That means everything is wrapped in a chat interface and powered by the Command model. Thus, you can build chatbots that have the full context of a conversation and are not limited to a single interaction.
Query generation: With Cohere’s RAG solution, you also get an LLM that’s trained for query generation. It takes a user message and transforms it into queries that are more relevant and optimized for the retrieval process.
Retrieval models: Cohere Embed helps you build a high-quality semantic search system that retrieves the most relevant documents using embeddings. On top of that, Cohere Rerank helps you boost the results further by reranking the search results based on relevance.
Response generation: Cohere’s RAG solution gives you an LLM that can provide the right responses to the user in different scenarios. A good RAG system should generate a grounded response based on relevant documents, but it should not do that every single time. The system also has to be able to determine whether or not any of the provided documents are relevant (and possibly decide that none are relevant), as well as decide that it can directly respond without needing any documents retrieved.
Fine-grained citation: Each grounded response includes fine-grained citations linking back to the source documents. This makes the response easily verifiable and builds trust with the user.
Connector mode: What makes RAG work is having the data in the first place. In enterprises, data is typically spread across many platforms, and integrating data sources into a RAG system can be a huge challenge. Cohere Chat comes with a “connector mode,” which makes it easy to connect to multiple datastores.
Quickstart connectors: Cohere's quickstart connectors allow you to quickly get up and running. These over 80 pre-built connectors are ready to use, including those for Google Drive, Slack, GitHub, Elastic, Pinecone, and more.
Automated document handling: One common challenge in RAG is handling long documents at scale. The Cohere API provides an option for automating document handling, from chunking up to fitting them into a prompt.
Document mode: For developers who want greater control over each component of a RAG system, Cohere Chat in document mode provides the modularity and flexibility needed to design such systems.

Try It with Coral

To see Cohere-powered RAG in action, you can try Coral, which is a conversational AI toolkit for enterprises to build RAG-enabled knowledge assistants. Coral includes some document grounding functionalities out-of-the-box, such as web search results, specific domain grounding, and PDF document support.

Users can engage Coral by entering a prompt to find answers from across their documents. Generated responses include citations of the information sources used, which verifies their accuracy and mitigates LLM hallucinations.

A screenshot of Coral, Cohere's conversational AI toolkit for enterprises

Step-by-Step Guide

Let’s start our exploration of RAG with a quick example.

We’ll walk through how to ground an LLM’s response with information from external documents and provide document citations along with it. In this example, we’ll use a static, short list of documents. Below is a diagram that provides an overview of our simple RAG system.

Setup

First, let’s install and import the cohere library, and then create a Cohere client using an API key.

pip install cohere

import cohere
co = cohere.Client("COHERE_API_KEY")

Define the Documents

Next, we define the documents that we want to ground an LLM’s response with, formatted as a list. In our case, each document consists of two fields: title and text.

The documents list includes a list of documents with a “text” field containing the information we want the model to use. The recommended length for the snippet of each document is relatively short, 300 words or less. We recommend using field names similar to the ones we’ve included in this example (i.e., “title” and “text”), but RAG is quite flexible with respect to how you structure the documents listings. You can give the fields any names you want, and you can pass in other fields as well, such as a “date” field. All field names and field values are passed to the model.

documents = [
    {
        "title": "Tall penguins",
        "text": "Emperor penguins are the tallest."},
    {
        "title": "Penguin habitats",
        "text": "Emperor penguins only live in Antarctica."},
    {
        "title": "What are animals?",
        "text": "Animals are different from plants."}
]

Generate the Response with Citations

Cohere’s RAG functionalities are part of the Chat endpoint, with the Command model as the underlying LLM. This allows developers to build chatbots that have the full context of a conversation and are not limited to a single interaction.

First, we define the message coming from the user. We’ll use a simple query, “What are the tallest living penguins?”, as an example.

# Get the user message
message = "What are the tallest living penguins?"

Then, we pass this message as a message parameter to a Chat endpoint call. We also pass the list of documents as a documents parameter. By using the chat_stream method, the response is generated incrementally by token without having to wait for the full completion.

# Generate the response
response = co.chat_stream(message=message,
                          documents=documents)

Finally, we display the response from the model. The streamed response will return different types of objects, and for now, we are interested in the text-generation objects, which contain the generated text.

We also display the citations and source documents, which we can get from the final object returned by the streamed response.

# Display the response
citations = []
cited_documents = []

for event in response:
    if event.event_type == "text-generation":
        print(event.text, end="")
    elif event.event_type == "citation-generation":
        citations.extend(event.citations)
    elif event.event_type == "stream-end":
      cited_documents = event.response.documents

# Display the citations and source documents
if citations:
  print("\n\nCITATIONS:")
  for citation in citations:
    print(citation)

  print("\nDOCUMENTS:")
  for document in cited_documents:
    print(document)

And here’s the response generated by our RAG system.

The tallest living penguins are emperor penguins, which are found only in Antarctica.

CITATIONS:
start=32 end=48 text='emperor penguins' document_ids=['doc_0']
start=66 end=85 text='only in Antarctica.' document_ids=['doc_1']

DOCUMENTS:
{'id': 'doc_0', 'text': 'Emperor penguins are the tallest.', 'title': 'Tall penguins'}
{'id': 'doc_1', 'text': 'Emperor penguins only live in Antarctica.', 'title': 'Penguin habitats'}

First, we get the actual text response from the model (see the output below).

This is followed by a list of citations, which are references to specific source documents on our list that provided the information contained in specific passages within the text response. For example, the first citation indicates that the term “emperor penguins,” which appears between the 32nd and 48th characters of the response, came from the first document on the list ('doc_0').

Finally, we get the full list of the source documents used to generate the response.

Conclusion

In this chapter, you learned about RAG and how to get started with RAG using the Cohere Chat endpoint.

Continue to the next chapter to learn how to build a RAG-powered chatbot that leverages text embeddings using the Chat, Embed, and Rerank endpoints.

About Cohere’s LLM University

Our comprehensive curriculum aims to equip you with the skills to develop your own AI applications. We cater to learners from all backgrounds, covering everything from the basics to the most advanced topics in large language models (LLMs). Plus, you'll have the opportunity to work on hands-on exercises, allowing you to build and deploy your very own solutions. Take a course today.

Getting Started with Retrieval-Augmented Generation