Context by Cohere
How an LLM Chatbot Works: Exploring Chat with Retrieval-Augmented Generation (RAG)

How an LLM Chatbot Works: Exploring Chat with Retrieval-Augmented Generation (RAG)

Part 1 of the LLM University module on Chat with Retrieval-Augmented Generation.

Share:

This learning module is part of Cohere’s LLM University. We offer a comprehensive curriculum to give you a rock-solid foundation in large language models. To learn more, see the full course.

Contents

Chatbots brought large language models (LLMs) into the mainstream. LLMs have been around for a few years, but their adoption was largely limited to the AI community. The launch of AI-powered consumer chatbots has made LLMs accessible to the everyday user, and now they're a hot topic in tech and enterprise circles alike.

This module teaches you how to build LLM chatbots using Cohere’s Chat endpoint.

You’ll learn how to:

  • Build chatbots that are powered by the text generation capabilities of LLMs
  • Equip chatbots with conversational memory to make them context-aware
  • Connect to external data sources via the RAG approach
  • Employ RAG for timely, more accurate chatbot responses
  • Build a custom model for your chatbot, finetuned to your data

In this first chapter of the module, we’ll begin by understanding the generative component of the Chat endpoint. Then we’ll look at how retrieval-augmented generation (RAG) makes it possible to connect a chatbot to external data like the web or a company’s own information.

This LLM University module on Chat with Retrieval-Augmented Generation (RAG) consists of the following chapters:

  1. Foundations of Chat and RAG (this chapter)
  2. Using the Chat endpoint
  3. Using the Chat endpoint with RAG in document mode
  4. Using the Chat endpoint with RAG in connector mode
  5. Creating custom models for the Chat endpoint

How Does an LLM Chatbot Work?

To understand how LLM chatbots work, it’s important to develop an understanding of its building blocks. This section focuses on how to build the generative part of a chatbot by looking at how to use a foundational model and added layers of context to generate answers in a conversation style.

The Foundation of an LLM Chatbot

The foundation of an LLM chatbot is a baseline LLM that can generate a response given a prompt or message from a user. This type of model is tuned to follow instructions and questions, such as “Write a headline for my homemade jewelry product” or “What is the capital of Canada?”.

A message or prompt returning a model response
A message or prompt returning a model response

However, a baseline LLM’s context is limited to only the last message it receives and not any previous messages and responses.

Yet, chatbots are characterized by their ability to maintain a conversation with a user, which takes place over multiple interactions.

A baseline LLM’s context is limited to only the last message it receives
A baseline LLM’s context is limited to only the last message it receives

The goal of a chatbot is to solve this problem by linking a sequence of interactions into a single instance, allowing the chatbot to hold an ongoing conversation. In doing so, the model’s response can keep a memory of all the previous interactions instead of having to start from scratch every time.

How to Build a Chatbot's Memory

Working off of the baseline generation model above, we can layer together multiple interactions into a single prompt and create a memory of the entire conversation.

First, we add a system-level prompt called a preamble. A preamble contains instructions to help steer a chatbot’s response toward specific characteristics, such as a persona, style, or format. For example, if we want the chatbot to adopt a formal style, the preamble can be used to encourage the generation of more business-like and professional responses. The preamble could be something like "You are a helpful chatbot, trained to assist human users by providing responses in a formal and professional tone."

Then, we append the current user message to the preamble, which becomes the prompt for the chatbot’s response. Next, we append the chatbot response and the following user message to the prompt.

We can repeat this step for any number of interactions until we reach the model’s maximum context length. Context length is the total number of tokens taken up by the prompt and response, and each model has a maximum context length that it can support.

Building a conversation by stitching multiple prompt-response pairs together
Building a conversation by stitching multiple prompt-response pairs together

This multi-turn framework is what gives chatbots the ability to hold the full context of conversation from start to finish.

Multi-turn conversations can happen when the full context is available
Multi-turn conversations can happen when the full context is available

However, building on top of a baseline LLM alone is not sufficient.

Chatbots need to perform well in a wide range of scenarios. To create a robust chatbot that consistently generates high-quality and reliable output, the baseline LLM needs to be adapted specifically to conversations. This means taking the baseline model and finetuning it further with a large volume of conversational data.

This is what forms the foundation of Cohere’s Chat endpoint — let’s take a closer look.

Cohere's Chat Endpoint

Improving LLM chatbot performance starts with how the baseline LLM is trained. The model powering the Chat endpoint is Cohere’s Command model, trained with a large volume of multi-turn conversations. This ensures that the model will excel at the various nuances associated with conversational language and perform well across different use cases.

Beyond training, finetuning a baseline LLM for conversations requires adding a standardized interface on top of the prompt formatting system. The Chat endpoint provides a consistent, simplified, and structured way of handling the prompt formatting that defines how the prompt inputs should be organized, making it easier for developers to build chatbot applications. This added layer includes a fixed abstraction and schema, providing more stability to scale and build applications on top of the foundation model.

The Chat endpoint includes all the elements required for an LLM chatbot (as discussed in the previous sections), exposing a simple interface for developers. It consists of the following key components:

  • Preamble management: Developers can opt to use the endpoint’s default preamble or override it with their own preambles.
  • Multi-turn conversations: The Chat endpoint builds upon the Command model by enabling multi-turn conversations.
  • State management: State management preserves the conversation memory. Developers can either leverage the endpoint’s conversation history persistence feature or manage the conversation history themselves.
  • Fully-managed conversation: The abstraction layer of the Chat endpoint means there’s only one item to send to the API: the user message. Everything else is managed automatically. At the same time, developers who want greater control over a chatbot’s configuration can still do so.
The Chat endpoint takes care of the underlying logic, exposing a simple interface for developers
The Chat endpoint takes care of the underlying logic, exposing a simple interface for developers

Adding RAG to Your Chatbot

Now that we understand how an LLM chatbot works, how can we make it more resilient and useful? Retrieval-augmented generation (RAG) provides a whole new capability for enterprise-level chatbots. With secure deployment options, chatbots can now be used with a company’s own information and knowledge.

Deliver fresh, comprehensive answers 

So far, we’ve explored LLM chatbots that only have access to the data they have been trained on, referred to as the internal knowledge. In many applications, particularly for enterprise-use, for a chatbot to be useful, it needs to also be able to access external knowledge.

Connecting the Chat endpoint with external knowledge
Connecting the Chat endpoint with external knowledge

Suppose a company wants to deploy a chatbot as an intelligent knowledge assistant. For the chatbot to be useful, it will need to be connected to the company’s knowledge base. This allows the chatbot to have the correct context when responding to requests, such as summarizing a meeting transcript, extracting information from the company wiki, and assisting a customer support agent in responding to a customer inquiry. Without access to the company’s knowledge base, the chatbot will not be able to perform these types of tasks successfully.

RAG solves the lack of specific knowledge problem
RAG solves the lack of specific knowledge problem

The company will also likely need the chatbot to respond to time-sensitive prompts and provide up-to-date answers. For example, suppose an employee asks the chatbot about a recent public event. A baseline LLM is trained with data that is current up to a certain cut-off time. Without accessing external data, the model relies on the most recent information it has been trained on (assuming the specific information is available in the training data). In this situation, the lack of recency in the training data would produce an inadequate answer.

RAG solves the lack of recency problem
RAG solves the lack of recency problem

Cohere’s Chat endpoint comes with a RAG feature that makes it possible to connect to external knowledge bases and deliver more accurate responses. RAG consists of two parts: a retrieval system and an augmented generation system. Let’s take a look at how they work.

Retrieve External Knowledge

The first part of RAG is to retrieve the right information needed to respond to a user query. Given a user message (1), the Chat endpoint queries an external knowledge base with the relevant queries (2), and finally retrieves the query results (3).

The retrieval part of RAG: Given a user message, the endpoint retrieves information from an external knowledge base
The retrieval part of RAG: Given a user message, the endpoint retrieves information from an external knowledge base

One example use case is performing a web search. Suppose we input this query in the chatbot: “Who was the keynote speaker at the AI conference last week?”.

Given that the response requires a fact from a recent event, the chatbot triggers a retrieval of this information using a web search API. It sends the query to the web search API and gets back the information it requires, for example, a few website snippets containing the details about the conference.

An example of retrieving information via web search
An example of retrieving information via web search

Another widely implemented use case is semantic search. Given a user query, the system’s task is to find the top documents, or portions of documents, that are the most similar to the query.

Before retrieval can happen, these documents first need to be ingested. Typically, this involves chunking large documents into smaller chunks, turning these chunks into text embeddings, and storing these embeddings in a database.

An example of retrieving information via semantic search
An example of retrieving information via semantic search

More generally, retrieval applies to any system that can fetch relevant documents based on the user message. Cohere’s Chat endpoint comes with a connector mode, where developers can leverage pre-built connectors to various data sources and systems or even build their own. Currently, a native web search connector is available for use with more connectors coming.

Connector mode enables the Chat endpoint to connect with data sources (currently a native web search connector available for use)
Connector mode enables the Chat endpoint to connect with data sources (currently a native web search connector available for use)

Create Better Responses with Augmented Generation 

The second part of RAG is augmented generation. Here, the prompt is augmented with the information retrieved from the retrieval step. The prompt is now grounded with the best information to provide the user with an accurate and helpful response.

The chatbot responds to the user query, now having the augmented prompt as its context (4).

The augmented generation part of RAG: The Chat endpoint uses the retrieved information to provide a grounded response
The augmented generation part of RAG: The Chat endpoint uses the retrieved information to provide a grounded response

Cohere’s Chat endpoint also provides citations to indicate the parts of the retrieved documents on which the response was grounded. Citations provide a critical benefit by delivering the generated content with verifiable references, enhancing the credibility and trustworthiness of the presented information, and allowing users to further explore responses for a deeper understanding.

Citations provide verifiable references to the generated content
Citations provide verifiable references to the generated content

That's a Wrap

This article provides an introduction to LLM chatbots and Cohere's Chat endpoint.

First, we looked at the generative component of an LLM chatbot using the Chat endpoint and how it evolved from a baseline text generation model to incorporate conversational memory. Then we looked at enhancing LLM chatbots with RAG, a key component of the Chat endpoint that makes it possible to connect the API to external data for augmented generation.

In the coming articles, we’ll see these concepts in action through hands-on examples using Cohere’s API.

Sign up for the Cohere platform and try out the Chat endpoint.

About LLM University

Our comprehensive NLP curriculum aims to equip you with the skills to develop your own AI-powered applications. We cater to learners from all backgrounds, covering everything from the basics to the most advanced topics in large language models (LLMs). Plus, you'll have the opportunity to work on hands-on exercises, allowing you to build and deploy your very own solutions. Take a course today. 

Keep reading