Context by Cohere
How to Build RAG Applications Over Large-Scale Data

How to Build RAG Applications Over Large-Scale Data

Part 5 of the LLM University module on Retrieval-Augmented Generation.

Share:

In this chapter, you’ll learn how to build RAG applications over multiple datastores and long documents.

We’ll use Cohere’s Python SDK for the code examples. Follow along in this notebook. Note: To run the notebook, you must first deploy your own Google Drive connector as a web-based REST API (we covered the steps in the previous chapter).

Contents


In the previous chapter, you learned how to build your own Google Drive connector, which is one of 80+ pre-built quickstart connectors available.

In this chapter, you’ll learn how to use connectors at scale, such as connecting to multiple datastores, working with large volumes of documents, and handling long documents. Enterprises need a RAG system that can efficiently handle vast amounts of data from diverse sources, and in this chapter, you’ll learn about how this can be automated with the Chat endpoint.

In an enterprise setting, data is distributed across multiple platforms and datastores. The real value of using connectors comes from being able to use multiple connectors at the same time. This way, we are maximizing the RAG system’s potential as an intelligent knowledge assistant, giving it access to various data sources, so it can synthesize the information from all these data sources.

Step-by-Step Guide

Let’s now look at an example of using the two connectors we used in the previous two chapters: Google Drive and web search.

An overview of what we'll build
An overview of what we'll build

Setup

First, let’s install and import the cohere library, and then create a Cohere client using an API key.

pip install cohere
import cohere
co = cohere.Client("COHERE_API_KEY")

Using Multiple Connectors

In the previous two chapters, we only examined examples where one connector was defined at a time. However, the Chat endpoint can accept multiple connectors and retrieve information from all the defined connectors.

To create a chatbot, we can reuse the same exact code we used in the previous chapter.

class Chatbot:
    def __init__(self, connectors: List[str]):
        """
        Initializes an instance of the Chatbot class.

        """
        self.conversation_id = str(uuid.uuid4())
        self.connectors = [ChatConnector(id=connector) for connector in connectors]

    def run(self):
        """
        Runs the chatbot application.

        """
        while True:
            # Get the user message
            message = input("User: ")

            # Typing "quit" ends the conversation
            if message.lower() == "quit":
                print("Ending chat.")
                break
            # else:                         # Uncomment for Google Colab to avoid printing the same thing twice
            #     print(f"User: {message}") # Uncomment for Google Colab to avoid printing the same thing twice

            # Generate response
            response = co.chat_stream(
                    message=message,
                    model="command-r",
                    conversation_id=self.conversation_id,
                    connectors=self.connectors,
            )

            # Print the chatbot response, citations, and documents
            print("\nChatbot:")
            citations = []
            cited_documents = []

            # Display response
            for event in response:
                if event.event_type == "text-generation":
                    print(event.text, end="")
                elif event.event_type == "citation-generation":
                    citations.extend(event.citations)
                elif event.event_type == "search-results":
                    cited_documents = event.documents

            # Display citations and source documents
            if citations:
              print("\n\nCITATIONS:")
              for citation in citations:
                print(citation)

              print("\nDOCUMENTS:")
              for document in cited_documents:
                print({'id': document['id'],
                      'snippet': document['snippet'][:50] + '...',
                      'title': document['title'],
                      'url': document['url']})

            print(f"\n{'-'*100}\n")

And when running the chatbot, we define the connectors we want the endpoint to retrieve information from.

The Chatbot class has already been prepared to accept multiple connectors.

class Chatbot:
    def __init__(self, connectors: List[str]):
        ...
        self.connectors = [{"id": c} for c in connectors]
    ...

And what’s actually sent as the connectors parameter in the endpoint call will be the following. 

response = co.chat(
        message=message,
	    connectors=[ChatConnector(id="demo-conn-gdrive-6bfrp6"), ChatConnector(id="web-search")]
 ...
  )

When creating the Chatbot instance, we define the connector IDs as a list of strings.

# Define connectors
connectors = ["demo-conn-gdrive-6bfrp6", "web-search"]

# Create an instance of the Chatbot class
chatbot = Chatbot(connectors)

# Run the chatbot
chatbot.run()

Here’s an example conversation. The connector uses information retrieved from both sources, as can be seen in the list of source documents.

User: What is chain of thought prompting

Chatbot:
Chain of thought prompting is a technique used with large language models (LLMs) to enhance their reasoning capabilities. The LLM is presented with a few examples demonstrating a step-by-step reasoning process leading to a correct answer. This method can be employed when dealing with complex problems that require breaking down into smaller, more manageable parts. 

For instance, if you were to ask an LLM to solve a linear equation, you would first show how to solve this type of equation by outlining the intermediate steps. The LLM would then attempt to solve the given problem using a similar step-by-step approach.

This prompting technique is particularly useful for arithmetic, commonsense, and symbolic reasoning tasks and can be combined with few-shot prompting for better results on more complex problems.

CITATIONS:
start=52 end=73 text='large language models' document_ids=['web-search_0', 'web-search_2', 'web-search_3', 'web-search_4', 'web-search_5', 'web-search_6', 'web-search_7', 'web-search_9', 'demo-conn-gdrive-6bfrp6_11', 'demo-conn-gdrive-6bfrp6_12']
start=74 end=80 text='(LLMs)' document_ids=['web-search_0', 'web-search_2', 'web-search_3', 'web-search_4', 'web-search_5', 'web-search_6', 'web-search_7', 'web-search_9', 'demo-conn-gdrive-6bfrp6_11', 'demo-conn-gdrive-6bfrp6_12']
start=84 end=121 text='enhance their reasoning capabilities.' document_ids=['web-search_0', 'web-search_1', 'web-search_2', 'web-search_4', 'web-search_5', 'web-search_6', 'web-search_7', 'web-search_9']
start=148 end=162 text='a few examples' document_ids=['web-search_0', 'web-search_1', 'web-search_2', 'web-search_3', 'web-search_4', 'web-search_5', 'web-search_6', 'web-search_7', 'demo-conn-gdrive-6bfrp6_11']
start=179 end=209 text='step-by-step reasoning process' document_ids=['web-search_0', 'web-search_1', 'web-search_2', 'web-search_3', 'web-search_4', 'web-search_5', 'web-search_6', 'web-search_7', 'web-search_9', 'demo-conn-gdrive-6bfrp6_11']
start=223 end=238 text='correct answer.' document_ids=['web-search_1', 'web-search_2', 'web-search_3', 'web-search_4', 'web-search_5', 'web-search_6', 'web-search_7', 'demo-conn-gdrive-6bfrp6_11']
start=285 end=301 text='complex problems' document_ids=['web-search_1', 'web-search_2', 'web-search_3', 'web-search_4', 'web-search_5', 'web-search_6', 'web-search_7']
start=334 end=365 text='smaller, more manageable parts.' document_ids=['web-search_4', 'web-search_6']
start=411 end=434 text='solve a linear equation' document_ids=['web-search_2']
start=452 end=491 text='show how to solve this type of equation' document_ids=['web-search_2']
start=495 end=528 text='outlining the intermediate steps.' document_ids=['web-search_2', 'web-search_3', 'web-search_4', 'web-search_5', 'web-search_6', 'web-search_7', 'demo-conn-gdrive-6bfrp6_11']
start=548 end=621 text='attempt to solve the given problem using a similar step-by-step approach.' document_ids=['web-search_2', 'web-search_4']
start=675 end=685 text='arithmetic' document_ids=['web-search_0', 'web-search_2', 'web-search_3', 'web-search_4', 'web-search_5', 'web-search_6', 'web-search_7', 'web-search_9']
start=687 end=698 text='commonsense' document_ids=['web-search_0', 'web-search_2', 'web-search_3', 'web-search_4', 'web-search_5', 'web-search_6', 'web-search_7', 'web-search_9']
start=704 end=728 text='symbolic reasoning tasks' document_ids=['web-search_0', 'web-search_2', 'web-search_3', 'web-search_4', 'web-search_5', 'web-search_6', 'web-search_7', 'web-search_9']
start=740 end=772 text='combined with few-shot prompting' document_ids=['web-search_1', 'web-search_2']
start=777 end=817 text='better results on more complex problems.' document_ids=['web-search_1']

DOCUMENTS:
{'id': 'web-search_0', 'text': 'Skip to main content\n\nWe gratefully acknowledge su...'}
{'id': 'web-search_1', 'text': 'General Tips for Designing Prompts\n\nChain-of-Thoug...'}
{'id': 'web-search_2', 'text': 'BlogDocsCommunityHackAPrompt Playground\n\nLanguage ...'}
{'id': 'web-search_3', 'text': 'We now support using Microsoft Azure hosted OpenAI...'}
{'id': 'web-search_4', 'text': 'Comprehensive Guide to Chain-of-Thought Prompting\n...'}
{'id': 'web-search_5', 'text': 'ResourcesArticleChain-of-Thought Prompting: Helpin...'}
{'id': 'web-search_6', 'text': 'Let’s Think Step by Step: Advanced Reasoning in Bu...'}
{'id': 'web-search_7', 'text': 'Unraveling the Power of Chain-of-Thought Prompting...'}
{'id': 'web-search_8', 'text': 'AboutPressCopyrightContact usCreatorsAdvertiseDeve...'}
{'id': 'web-search_9', 'text': 'Skip to main content\n\nLanguage Models Perform Reas...'}
{'id': 'demo-conn-gdrive-6bfrp6_10', 'text': "\ufeffChaining Prompts\r\nIn this chapter, you'll learn a..."}
{'id': 'demo-conn-gdrive-6bfrp6_11', 'text': "\ufeffConstructing Prompts\r\nIn this chapter, you'll lea..."}
{'id': 'demo-conn-gdrive-6bfrp6_12', 'text': "\ufeffUse Case Patterns\r\nIn this chapter, you'll learn ..."}
{'id': 'demo-conn-gdrive-6bfrp6_13', 'text': "\ufeffEvaluating Outputs\r\nIn this chapter, you'll learn..."}
{'id': 'demo-conn-gdrive-6bfrp6_14', 'text': "\ufeffValidating Outputs\r\nIn this chapter, you'll learn..."}

----------------------------------------------------------------------------------------------------

User: tell me more

Chatbot:
Chain of thought prompting is a technique that guides LLMs to follow a reasoning process by providing them with a few examples that clearly outline each step of the reasoning. This method, also known as few-shot prompting, is employed for complex problems that require a series of reasoning steps to solve. 

The LLM is expected to study the example and follow a similar pattern when answering, breaking down the problem into smaller, more manageable parts. This approach not only improves the LLM's performance on complex tasks but also offers interpretability into its thought process.

Few-shot prompting is distinct from zero-shot prompting, where the LLM is only given the problem and no examples. Zero-shot chain-of-thought prompting, however, involves adding a phrase like "Let's think step by step" to the original prompt to guide the LLM's reasoning. 

Chain of thought prompting has shown remarkable effectiveness in improving LLMs' abilities in arithmetic, commonsense, and symbolic reasoning tasks. Nevertheless, it is not without its limitations. For instance, it works best with larger models, typically those with around 100 billion parameters, as smaller models often produce illogical thought chains.

CITATIONS:
start=47 end=58 text='guides LLMs' document_ids=['web-search_0', 'web-search_1', 'web-search_2', 'web-search_3', 'web-search_4', 'web-search_5', 'web-search_6', 'web-search_7', 'web-search_9', 'demo-conn-gdrive-6bfrp6_11']
start=71 end=88 text='reasoning process' document_ids=['web-search_0', 'web-search_1', 'web-search_2', 'web-search_3', 'web-search_4', 'web-search_5', 'web-search_6', 'web-search_7', 'web-search_9', 'demo-conn-gdrive-6bfrp6_11']
start=114 end=126 text='few examples' document_ids=['web-search_0', 'web-search_1', 'web-search_2', 'web-search_3', 'web-search_4', 'web-search_5', 'web-search_6', 'web-search_7', 'web-search_9', 'demo-conn-gdrive-6bfrp6_11']
start=140 end=175 text='outline each step of the reasoning.' document_ids=['web-search_0', 'web-search_1', 'web-search_2', 'web-search_3', 'web-search_4', 'web-search_5', 'web-search_6', 'web-search_7', 'demo-conn-gdrive-6bfrp6_11']
start=203 end=221 text='few-shot prompting' document_ids=['web-search_2', 'web-search_3', 'web-search_4', 'demo-conn-gdrive-6bfrp6_11']
start=239 end=255 text='complex problems' document_ids=['web-search_0', 'web-search_1', 'web-search_2', 'web-search_3', 'web-search_4', 'web-search_5', 'web-search_6', 'web-search_7', 'web-search_9']
start=271 end=296 text='series of reasoning steps' document_ids=['web-search_0', 'web-search_1', 'web-search_2', 'web-search_3', 'web-search_4', 'web-search_5', 'web-search_6', 'web-search_7', 'web-search_9', 'demo-conn-gdrive-6bfrp6_11']
start=363 end=378 text='similar pattern' document_ids=['web-search_0', 'web-search_1', 'web-search_2', 'web-search_3', 'web-search_4', 'web-search_5', 'web-search_6', 'web-search_7', 'web-search_9', 'demo-conn-gdrive-6bfrp6_11']
start=426 end=457 text='smaller, more manageable parts.' document_ids=['web-search_0', 'web-search_1', 'web-search_2', 'web-search_3', 'web-search_4', 'web-search_5', 'web-search_6', 'web-search_7', 'web-search_9', 'demo-conn-gdrive-6bfrp6_11']
start=481 end=511 text="improves the LLM's performance" document_ids=['web-search_0', 'web-search_1', 'web-search_2', 'web-search_3', 'web-search_4', 'web-search_5', 'web-search_6', 'web-search_7', 'web-search_9', 'demo-conn-gdrive-6bfrp6_11']
start=545 end=561 text='interpretability' document_ids=['web-search_2', 'web-search_4', 'web-search_5', 'web-search_6']
start=625 end=644 text='zero-shot prompting' document_ids=['web-search_4', 'demo-conn-gdrive-6bfrp6_11']
start=703 end=739 text='Zero-shot chain-of-thought prompting' document_ids=['web-search_1', 'web-search_2', 'web-search_3', 'web-search_4', 'web-search_5', 'web-search_6', 'web-search_7', 'demo-conn-gdrive-6bfrp6_11']
start=780 end=806 text='"Let\'s think step by step"' document_ids=['web-search_1', 'web-search_3', 'web-search_4', 'web-search_6', 'demo-conn-gdrive-6bfrp6_11']
start=833 end=859 text="guide the LLM's reasoning." document_ids=['web-search_1', 'web-search_3', 'web-search_4', 'web-search_5', 'web-search_6', 'web-search_7', 'demo-conn-gdrive-6bfrp6_11']
start=956 end=966 text='arithmetic' document_ids=['web-search_0', 'web-search_2', 'web-search_3', 'web-search_4', 'web-search_5', 'web-search_6', 'web-search_7', 'web-search_9']
start=968 end=979 text='commonsense' document_ids=['web-search_0', 'web-search_2', 'web-search_3', 'web-search_4', 'web-search_5', 'web-search_6', 'web-search_7', 'web-search_9']
start=985 end=1010 text='symbolic reasoning tasks.' document_ids=['web-search_0', 'web-search_2', 'web-search_3', 'web-search_4', 'web-search_5', 'web-search_6', 'web-search_7', 'web-search_9']
start=1093 end=1106 text='larger models' document_ids=['web-search_0', 'web-search_2', 'web-search_3', 'web-search_4', 'web-search_6', 'web-search_7', 'web-search_9']
start=1136 end=1158 text='100 billion parameters' document_ids=['web-search_0', 'web-search_2', 'web-search_3', 'web-search_4', 'web-search_6', 'web-search_7', 'web-search_9']
start=1163 end=1217 text='smaller models often produce illogical thought chains.' document_ids=['web-search_0', 'web-search_2', 'web-search_3', 'web-search_4', 'web-search_6', 'web-search_7']

DOCUMENTS:
{'id': 'web-search_0', 'text': 'Skip to main content\n\nWe gratefully acknowledge su...'}
{'id': 'web-search_1', 'text': 'General Tips for Designing Prompts\n\nChain-of-Thoug...'}
{'id': 'web-search_2', 'text': 'BlogDocsCommunityHackAPrompt Playground\n\nLanguage ...'}
{'id': 'web-search_3', 'text': 'We now support using Microsoft Azure hosted OpenAI...'}
{'id': 'web-search_4', 'text': 'Comprehensive Guide to Chain-of-Thought Prompting\n...'}
{'id': 'web-search_5', 'text': 'ResourcesArticleChain-of-Thought Prompting: Helpin...'}
{'id': 'web-search_6', 'text': 'Let’s Think Step by Step: Advanced Reasoning in Bu...'}
{'id': 'web-search_7', 'text': 'Unraveling the Power of Chain-of-Thought Prompting...'}
{'id': 'web-search_8', 'text': 'AboutPressCopyrightContact usCreatorsAdvertiseDeve...'}
{'id': 'web-search_9', 'text': 'Skip to main content\n\nLanguage Models Perform Reas...'}
{'id': 'demo-conn-gdrive-6bfrp6_10', 'text': "\ufeffChaining Prompts\r\nIn this chapter, you'll learn a..."}
{'id': 'demo-conn-gdrive-6bfrp6_11', 'text': "\ufeffConstructing Prompts\r\nIn this chapter, you'll lea..."}
{'id': 'demo-conn-gdrive-6bfrp6_12', 'text': "\ufeffUse Case Patterns\r\nIn this chapter, you'll learn ..."}
{'id': 'demo-conn-gdrive-6bfrp6_13', 'text': "\ufeffEvaluating Outputs\r\nIn this chapter, you'll learn..."}
{'id': 'demo-conn-gdrive-6bfrp6_14', 'text': "\ufeffValidating Outputs\r\nIn this chapter, you'll learn..."}

----------------------------------------------------------------------------------------------------

User: quit
Ending chat.


Handling Long and Large Volume of Documents

With all these documents coming from various connectors, you may be asking a couple of questions:

  • How to handle long documents? Connecting to multiple connectors means having to deal with various APIs, each with its own way of providing documents. Some may return a complete document with tens or hundreds of pages. There are a couple of problems with this. First, stuffing a long document into an LLM prompt means its context limit will be reached, resulting in an error. Second, even if the context limit is not reached, the LLM response will likely not be very good because it is getting a lot of irrelevant information from a long document instead of specific chunks from the document that are the most relevant.
  • How to handle multiple documents from multiple connectors and queries? For a specific connector, the retrieval and reranking implementation is within the developer’s control. But with multiple connectors, that is not possible because these documents are aggregated at the Chat endpoint. As the number of connectors increases, this becomes a bigger problem because we don’t have control over the relevancy of the documents sent to the LLM prompt. And then there is the same problem of possible context limits being reached. Furthermore, if more than one query is generated, the number of documents retrieved will multiply with the same number.

The Chat endpoint solves these problems with its automated chunking and reranking process. Let’s see how it’s done.

Note that for this to happen, the prompt_truncation parameter should be set as AUTO (default) and not OFF.

Chunking

With every document sent by the connectors, the first step is to split it into smaller chunks. Each chunk is between 100 and 400 words, and sentences are kept intact where possible.

Chunking the retrieved documents
Chunking the retrieved documents

Going back to the example responses, notice that some document IDs are shown as such web-search_5:2. It contains not just the document ID (5 in this example) but also another number separated by a colon (2 in this example). This represents the chunk number of the document. If we concatenate web-search_2:0, web-search_2:1, web-search_2:2, and so on, we’ll get back the original document.

Reranking

The Chat endpoint then uses the Rerank endpoint to take all the chunked documents from all connectors and rerank them based on contextual relevance to the query.

Reranking the chunked documents
Reranking the chunked documents

This will be independent for each query and connector. For example, let’s say that a user asks the question, “What is AI and how can enterprises use it?” resulting in two queries generated by the endpoint: “What is AI?” and “How can enterprises use AI?” Also, let’s assume that there are two connectors: “web search” and “notion.”

This means that there will be four lists of chunked documents (two queries for two connectors), each to be reranked separately.

The reranking step takes the top 20 chunks from each list and drops the rest.

Interleaving

The reranked documents from the different lists are then interleaved into one list.

Interleaving the reranked chunks
Interleaving the reranked chunks

With our example above, let’s say that these are the four lists of reranked documents:

  • Web search results (“What is AI”): web_ai_1, web_ai_2, web_ai_3
  • Notion search results (“What is AI”): notion_ai_1, notion_ai_2, notion_ai_3
  • Web search results (“How can enterprises use AI”): web_enterprise_1, web_enterprise_2, web_enterprise_3
  • Notion search results (“How can enterprises use AI”): notion_enterprise_1, notion_enterprise_2, notion_enterprise_3

The documents will be interleaved in a list in this order:

  • Documents: web_ai_1, notion_ai_1, web_enterprise_1, notion_enterprise_1, web_ai_2, notion_ai_2, web_enterprise_2, notion_enterprise_2, web_ai_3, notion_ai_3, web_enterprise_3, notion_enterprise_3

This list is what gets sent to the LLM prompt.

Prompt Building

By setting the prompt_truncation parameter by setting it to AUTO, some elements from chat_history and documents will be dropped in an attempt to construct a prompt that fits within the model's context length limit.

Documents and chat history will be iteratively added until the prompt is too long. This prompt will be passed to the Command model for response generation.

Conclusion

In this chapter, you learned how to use connectors at scale. The Chat endpoint allows you to define multiple connectors in an endpoint call, and will aggregate the retrieved documents from these connectors. You can also leverage the automated handling of long documents and large volumes of documents, where the endpoint takes care of chunking, reranking, and interleaving of documents, as well as prompt building.


About Cohere’s LLM University

Our comprehensive curriculum aims to equip you with the skills to develop your own AI applications. We cater to learners from all backgrounds, covering everything from the basics to the most advanced topics in large language models (LLMs). Plus, you'll have the opportunity to work on hands-on exercises, allowing you to build and deploy your very own solutions. Take a course today.

This LLMU module consists of the following chapters:

  1. Introduction to RAG
  2. RAG with Chat, Embed, and Rerank
  3. RAG with Connectors
  4. RAG with Quickstart Connectors
  5. RAG over Large-Scale Data (this chapter)
Keep reading