In this chapter, you’ll learn how to build RAG applications over multiple datastores and long documents.

We’ll use Cohere’s Python SDK for the code examples. Follow along in this notebook. Note: To run the notebook, you must first deploy your own Google Drive connector as a web-based REST API (we covered the steps in the previous chapter).

Contents

Step-by-Step Guide
Setup
Using Multiple Connectors
Handling Long and Large Volume of Documents
Conclusion

In the previous chapter, you learned how to build your own Google Drive connector, which is one of 80+ pre-built quickstart connectors available.

In this chapter, you’ll learn how to use connectors at scale, such as connecting to multiple datastores, working with large volumes of documents, and handling long documents. Enterprises need a RAG system that can efficiently handle vast amounts of data from diverse sources, and in this chapter, you’ll learn about how this can be automated with the Chat endpoint.

In an enterprise setting, data is distributed across multiple platforms and datastores. The real value of using connectors comes from being able to use multiple connectors at the same time. This way, we are maximizing the RAG system’s potential as an intelligent knowledge assistant, giving it access to various data sources, so it can synthesize the information from all these data sources.

Step-by-Step Guide

Let’s now look at an example of using the two connectors we used in the previous two chapters: Google Drive and web search.

Setup

First, let’s install and import the cohere and other necessary libraries, and then create a Cohere client using an API key.

pip install cohere

import cohere
from cohere import ChatConnector
import uuid
from typing import List, Dict

co = cohere.Client("COHERE_API_KEY")

Using Multiple Connectors

In the previous two chapters, we only examined examples where one connector was defined at a time. However, the Chat endpoint can accept multiple connectors and retrieve information from all the defined connectors.

To create a chatbot, we can reuse the same exact code we used in the previous chapter.

class Chatbot:
    def __init__(self, connectors: List[str]):
        """
        Initializes an instance of the Chatbot class.

        """
        self.conversation_id = str(uuid.uuid4())
        self.connectors = [ChatConnector(id=connector) for connector in connectors]

    def run(self):
        """
        Runs the chatbot application.

        """
        while True:
            # Get the user message
            message = input("User: ")

            # Typing "quit" ends the conversation
            if message.lower() == "quit":
                print("Ending chat.")
                break
            else:                       # If using Google Colab, remove this line to avoid printing the same thing twice
              print(f"User: {message}") # If using Google Colab, remove this line to avoid printing the same thing twice

            # Generate response
            response = co.chat_stream(
                    message=message,
                    model="command-r-plus",
                    conversation_id=self.conversation_id,
                    connectors=self.connectors,
            )

            # Print the chatbot response, citations, and documents
            print("\nChatbot:")
            citations = []
            cited_documents = []

            # Display response
            for event in response:
                if event.event_type == "text-generation":
                    print(event.text, end="")
                elif event.event_type == "citation-generation":
                    citations.extend(event.citations)
                elif event.event_type == "stream-end":
                    cited_documents = event.response.documents

            # Display citations and source documents
            if citations:
              print("\n\nCITATIONS:")
              for citation in citations:
                print(citation)

              print("\nDOCUMENTS:")
              for document in cited_documents:
                print({'id': document['id'],
                      'text': document.get('text', document.get('snippet', ''))[:50] + '...'}) # "text" for Gdrive, "snippet" for web search

            print(f"\n{'-'*100}\n")

And when running the chatbot, we define the connectors we want the endpoint to retrieve information from.

The Chatbot class has already been prepared to accept multiple connectors.

class Chatbot:
    def __init__(self, connectors: List[str]):
        ...
        self.connectors = [{"id": c} for c in connectors]
    ...

And what’s actually sent as the connectors parameter in the endpoint call will be the following.

response = co.chat(
        message=message,
	    connectors=[ChatConnector(id="demo-conn-gdrive-6bfrp6"), ChatConnector(id="web-search")]
 ...
  )

When creating the Chatbot instance, we define the connector IDs as a list of strings.

# Define connectors
connectors = ["demo-conn-gdrive-6bfrp6", "web-search"]

# Create an instance of the Chatbot class
chatbot = Chatbot(connectors)

# Run the chatbot
chatbot.run()

Here’s an example conversation. The connector uses information retrieved from both sources, as can be seen in the list of source documents.

User: What is chain of thought prompting

Chatbot:
Chain-of-Thought (CoT) prompting is a technique used to guide Large Language Models (LLMs) to follow a reasoning process when dealing with complex problems. This is done by providing the model with a few examples where the step-by-step reasoning is clearly laid out. The model is then expected to follow that "chain of thought" reasoning to get to the correct answer.

CoT prompting is a prompt engineering technique that aims to improve language models' performance on tasks requiring logic, calculation and decision-making by structuring the input prompt in a way that mimics human reasoning.

To construct a chain-of-thought prompt, a user typically appends an instruction such as "Describe your reasoning in steps" or "Let's think step by step" to the end of their query to a large language model (LLM). This encourages the model to generate intermediate steps before providing a final answer.

CITATIONS:
start=17 end=22 text='(CoT)' document_ids=['web-search_0', 'web-search_1', 'web-search_2', 'web-search_3', 'web-search_5', 'web-search_7', 'web-search_8', 'demo-conn-gdrive-6bfrp6_11', 'demo-conn-gdrive-6bfrp6_12']
start=56 end=61 text='guide' document_ids=['web-search_3', 'web-search_4', 'web-search_7']
start=62 end=83 text='Large Language Models' document_ids=['web-search_0', 'web-search_2', 'web-search_3', 'web-search_4', 'web-search_5', 'web-search_7', 'demo-conn-gdrive-6bfrp6_11']
start=84 end=90 text='(LLMs)' document_ids=['web-search_0', 'web-search_2', 'web-search_3', 'web-search_4', 'web-search_5', 'web-search_7', 'demo-conn-gdrive-6bfrp6_11']
start=94 end=120 text='follow a reasoning process' document_ids=['web-search_1', 'web-search_3', 'web-search_4', 'web-search_7']
start=139 end=156 text='complex problems.' document_ids=['web-search_3', 'web-search_5', 'web-search_7']
start=200 end=212 text='few examples' document_ids=['web-search_1', 'web-search_3', 'web-search_5', 'web-search_7', 'demo-conn-gdrive-6bfrp6_11']
start=223 end=266 text='step-by-step reasoning is clearly laid out.' document_ids=['web-search_1', 'web-search_3', 'web-search_4', 'web-search_5', 'web-search_7', 'demo-conn-gdrive-6bfrp6_11']
start=297 end=337 text='follow that "chain of thought" reasoning' document_ids=['web-search_3', 'web-search_5']
start=341 end=367 text='get to the correct answer.' document_ids=['web-search_3', 'web-search_4', 'web-search_5']
start=388 end=416 text='prompt engineering technique' document_ids=['web-search_4', 'web-search_5']
start=430 end=466 text="improve language models' performance" document_ids=['web-search_4']
start=486 end=524 text='logic, calculation and decision-making' document_ids=['web-search_4']
start=528 end=556 text='structuring the input prompt' document_ids=['web-search_4']
start=571 end=594 text='mimics human reasoning.' document_ids=['web-search_4']
start=684 end=718 text='"Describe your reasoning in steps"' document_ids=['web-search_4', 'demo-conn-gdrive-6bfrp6_11']
start=722 end=748 text='"Let\'s think step by step"' document_ids=['web-search_1', 'web-search_3', 'web-search_5', 'web-search_8', 'demo-conn-gdrive-6bfrp6_11', 'demo-conn-gdrive-6bfrp6_12']
start=846 end=864 text='intermediate steps' document_ids=['web-search_0', 'web-search_1', 'web-search_2', 'web-search_3', 'web-search_4', 'web-search_5', 'web-search_7', 'demo-conn-gdrive-6bfrp6_11']

DOCUMENTS:
{'id': 'web-search_0', 'text': 'Skip to main content\n\nWe gratefully acknowledge su...'}
{'id': 'web-search_1', 'text': 'General Tips for Designing Prompts\n\nChain-of-Thoug...'}
{'id': 'web-search_2', 'text': 'BlogDocsCommunityHackAPrompt Playground\n\nLanguage ...'}
{'id': 'web-search_3', 'text': 'We now support using Microsoft Azure hosted OpenAI...'}
{'id': 'web-search_5', 'text': 'Comprehensive Guide to Chain-of-Thought Prompting\n...'}
{'id': 'web-search_7', 'text': 'ResourcesArticleChain-of-Thought Prompting: Helpin...'}
{'id': 'web-search_8', 'text': 'Skip to main content\n\nScan this QR code to downloa...'}
{'id': 'demo-conn-gdrive-6bfrp6_11', 'text': "\ufeffConstructing Prompts\r\nIn this chapter, you'll lea..."}
{'id': 'demo-conn-gdrive-6bfrp6_12', 'text': "\ufeffUse Case Patterns\r\nIn this chapter, you'll learn ..."}
{'id': 'web-search_4', 'text': 'Tech Accelerator What is generative AI? Everything...'}

----------------------------------------------------------------------------------------------------

Ending chat.

Handling Long and Large Volume of Documents

With all these documents coming from various connectors, you may be asking a couple of questions:

How to handle long documents? Connecting to multiple connectors means having to deal with various APIs, each with its own way of providing documents. Some may return a complete document with tens or hundreds of pages. There are a couple of problems with this. First, stuffing a long document into an LLM prompt means its context limit will be reached, resulting in an error. Second, even if the context limit is not reached, the LLM response will likely not be very good because it is getting a lot of irrelevant information from a long document instead of specific chunks from the document that are the most relevant.
How to handle multiple documents from multiple connectors and queries? For a specific connector, the retrieval and reranking implementation is within the developer’s control. But with multiple connectors, that is not possible because these documents are aggregated at the Chat endpoint. As the number of connectors increases, this becomes a bigger problem because we don’t have control over the relevancy of the documents sent to the LLM prompt. And then there is the same problem of possible context length limits being reached. Furthermore, if more than one query is generated, the number of documents retrieved will multiply with the same number.

The Chat endpoint solves these problems with its automated chunking and reranking process. Let’s see how it’s done.

Note that for this to happen, the prompt_truncation parameter should be set as AUTO (default) and not OFF.

Chunking

The command-r family of models supports a large context length (128k tokens), offering ample room for retrieved documents. However, in the scenario where this context length is exceeded, the automated chunking feature will be activated.

The first step is to split every document sent by the connectors into smaller chunks. Each chunk is between 100 and 400 words, and sentences are kept intact where possible.

Reranking

The Chat endpoint then uses the Rerank endpoint to take all the chunked documents from all connectors and rerank them based on contextual relevance to the query.

This will be independent for each query and connector. For example, let’s say that a user asks the question, “What is AI and how can enterprises use it?” resulting in two queries generated by the endpoint: “What is AI?” and “How can enterprises use AI?” Also, let’s assume that there are two connectors: “web search” and “notion.”

This means that there will be four lists of chunked documents (two queries for two connectors), each to be reranked separately.

The reranking step takes the top 20 chunks from each list and drops the rest.

Interleaving

The reranked documents from the different lists are then interleaved into one list.

With our example above, let’s say that these are the four lists of reranked documents:

Web search results (“What is AI”): web_ai_1, web_ai_2, web_ai_3
Notion search results (“What is AI”): notion_ai_1, notion_ai_2, notion_ai_3
Web search results (“How can enterprises use AI”): web_enterprise_1, web_enterprise_2, web_enterprise_3
Notion search results (“How can enterprises use AI”): notion_enterprise_1, notion_enterprise_2, notion_enterprise_3

The documents will be interleaved in a list in this order:

Documents: web_ai_1, notion_ai_1, web_enterprise_1, notion_enterprise_1, web_ai_2, notion_ai_2, web_enterprise_2, notion_enterprise_2, web_ai_3, notion_ai_3, web_enterprise_3, notion_enterprise_3

This list is what gets sent to the LLM prompt.

Prompt Building

By setting the prompt_truncation parameter by setting it to AUTO, some elements from chat_history and documents will be dropped in an attempt to construct a prompt that fits within the model's context length limit.

Documents and chat history will be iteratively added until the prompt is too long. This prompt will be passed to the Command model for response generation.

Conclusion

In this chapter, you learned how to use connectors at scale. The Chat endpoint allows you to define multiple connectors in an endpoint call, and will aggregate the retrieved documents from these connectors. You can also leverage the automated handling of long documents and large volumes of documents, where the endpoint takes care of chunking, reranking, and interleaving of documents, as well as prompt building.

About Cohere’s LLM University

Our comprehensive curriculum aims to equip you with the skills to develop your own AI applications. We cater to learners from all backgrounds, covering everything from the basics to the most advanced topics in large language models (LLMs). Plus, you'll have the opportunity to work on hands-on exercises, allowing you to build and deploy your very own solutions. Take a course today.

This LLMU module consists of the following chapters:

How to Build RAG Applications Over Large-Scale Data