
How to Build a Chatbot with the Chat Endpoint
Part 2 of the LLM University module on Chat with Retrieval-Augmented Generation.
This learning module is part of Cohere’s LLM University. We offer a comprehensive curriculum to give you a rock-solid foundation in large language models. To learn more, see the full course.
Understanding how LLM chatbots work (Chapter 1) is the first step to creating your own chatbot, but the real learning takes place when you start building one from scratch. In this second chapter of the Chat with Retrieval-Augmented Generation (RAG) module, you’ll learn how to build a chatbot using Cohere’s Chat endpoint.
The Chat endpoint consists of two main parts: the Chat and RAG components. In this chapter, we’ll focus on the Chat component. We'll then cover RAG in the coming chapters.
By the end of this chapter, you’ll be able to build a simple chatbot that can respond to user messages and maintain the context of the conversation.
We’ll use Cohere’s Python SDK for the code examples. This chapter comes with a Google Colaboratory notebook. Additionally, the API reference page contains a detailed description of the Chat endpoint’s input parameters and response objects.
Contents
- Quickstart
- Defining a Preamble
- Streaming the Chatbot Response
- Building the Chat History
- Other Parameters
- That's a Wrap
Quickstart
To set up, we first import the Cohere module and create a client.
import cohere
co = cohere.Client("COHERE_API_KEY")
At its most basic, we only need to pass to the Chat endpoint the user message using the message
parameter – the only required parameter for the endpoint.

Here’s an example. We call the endpoint with "Hello" as the user message. The response returns a cohere.Chat
object. Right now, we’re interested in the main content of the response, which is stored in the text
value.
response = co.chat(message="Hello")
print(response.text)
# RESPONSE
# Hello! How can I assist you today?
Defining a Preamble
A conversation starts with a system message, or a preamble, to help steer a chatbot’s response toward certain characteristics. For example, if we want the chatbot to adopt a formal style, the preamble can be used to encourage the generation of more business-like and professional responses.
In the quickstart example, we didn’t have to define a preamble because a default one was used. We can, however, define our own preamble using the preamble_override
parameter.

preamble_override
parameter to define our own preambleHere’s an example. We added a preamble telling the chatbot to assume the persona of an expert public speaking coach. As a result, we get a response that adopts that persona.
response = co.chat(message="Hello",
preamble_override="You are an expert public speaking coach")
print(response.text)
# RESPONSE
Hello, I'm here to help you with your public speaking skills. What are you looking to work on today?
Some potential topics we could cover include:
- how to develop effective public speaking skills
- how to overcome common public speaking challenges, such as anxiety and fear
- how to use body language and eye contact to enhance your public speaking performance
- how to prepare and deliver powerful presentations
- how to use storytelling techniques to make your messages more engaging
- how to deal with difficult audience members or questions
- how to recover from public speaking mistakes
Let me know if any of these topics interest you, or if there is something else you would like to work on. I'm here to help you become a more confident and effective public speaker.
Streaming the Chatbot Response
Our examples so far generate responses in a non-streamed manner. This means that the endpoint would return a response object only after the model has generated the text in full. The longer the text is, the longer it takes to get back the response. If you are building an application, this directly impacts the user’s perception of the application’s latency.
The Chat endpoint solves this problem by supporting streamed responses. In a streamed response, the endpoint would return a response object for each token as it is being generated. This means you can display the text incrementally without having to wait for the full completion.
To activate it, set the stream
parameter to True
.
In streaming mode, the endpoint will generate a series of cohere.StreamTextGeneration
objects, as follows:
- The first object contains a field called
event_type
with the valuestream-start
. - This is followed by a sequence of objects where the
event_type
istext-generation
. - Finally, an object with where the
event_type
isstream-end
.
Thus, to get the actual text contents, we take objects whose event_type
is text-generation
.
Here’s an example with streamed responses activated.
response = co.chat(message="Hello",
preamble_override="You are an expert public speaking coach",
stream=True)
for event in response:
if event.event_type == "text-generation":
print(event.text, end='')
# RESPONSE (Streamed)
Hello, I'm here to help you with your public speaking skills. What are you looking to work on today?
Some potential topics we could cover include:
- How to build confidence when speaking in front of large groups
- Techniques for effective audience engagement
- Tips for structuring your speeches and presentations
- Strategies for managing public speaking nerves
- Best ways to use visual aids and props
- Guidance on how to adapt to different types of public speaking scenarios (e.g., formal presentations, impromptu speeches, etc.)
- Advice on how to handle difficult questions or audience members
- Tips for practicing and preparing for public speaking engagements
- Strategies for improving your public speaking skills in a sustainable manner
Building the Chat History
At the core of a conversation is a multi-turn dialog between the user and the chatbot. This requires the chatbot to have a “memory” of all the previous turns to maintain the state of the conversation.
Option 1: Using the conversation history persistence feature
The Chat endpoint supports state management by persisting the conversation history. As a conversation progresses, the endpoint continuously updates the conversation history. This means developers don’t have to deal with the complexity and inconvenience of managing conversation history in their application.

To use this feature, use the conversation_id
parameter, which is a unique string you assign to a conversation.
Putting everything together, let’s now build a simple chat interface that takes in a user message, generates the chatbot response, automatically updates the conversation history, and repeats these steps until the user quits the conversation.
Here, we use the uuid
library to generate a unique conversation_id
for each conversation. Additionally, we include the parameter return_chat_history
and set it to True
. This will allow us to view the conversation history stored by the endpoint.
# Create a conversation ID
import uuid
conversation_id = str(uuid.uuid4())
# Define the preamble
preamble_override = "You are an expert public speaking coach"
print('Starting the chat. Type "quit" to end.\n')
while True:
# User message
message = input("User: ")
# Typing "quit" ends the conversation
if message.lower() == 'quit':
print("Ending chat.")
break
# Chatbot response
response = co.chat(message=message,
preamble_override=preamble_override,
stream=True,
conversation_id=conversation_id,
return_chat_history=True)
print("Chatbot: ", end='')
for event in response:
if event.event_type == "text-generation":
print(event.text, end='')
print("\n")
# RESPONSE (Streamed)
User: Hello
Chatbot: Hello, I'm here to help you with your public speaking skills. What are you looking to work on?
User: I'd like to learn about techniques for effective audience engagement
Chatbot: Audience engagement is key to successful public speaking. It involves connecting with your audience and keeping their attention throughout your speech. Here are some tips to help you improve your audience engagement:
1. Know your audience: Understanding your audience's interests, expectations, and background can help you tailor your speech to their needs and increase their engagement.
2. Set the tone: How you start your speech can set the tone for the rest of your talk. Use a strong opener that captures the audience's attention and makes them want to listen.
3. Use visual aids:
...
User: quit
Ending chat.
The conversation history is stored in the chat_history
field of the response object.
import json
data = response.chat_history
print(json.dumps(data, indent=4))
# CHAT HISTORY
[
{
"user_name": "User",
"text": "Hello",
"message": "Hello",
"response_id": "b3c9cb27-7bc2-40db-987e-f71c8f8b3c8e",
"generation_id": "709b4c81-c134-45ed-acf2-81411489ff59",
"position": 2,
"active": true,
"role": "User"
},
{
"user_name": "Chatbot",
"text": "Hello, I'm here to help you with your public speaking skills. What are you looking to work on?",
"message": "Hello, I'm here to help you with your public speaking skills. What are you looking to work on?",
"response_id": "b3c9cb27-7bc2-40db-987e-f71c8f8b3c8e",
"generation_id": "7ff3dabd-78ea-454e-b1bd-579697a4f2c3",
"position": 3,
"active": true,
"role": "Chatbot"
},
{
"user_name": "User",
"text": "I'd like to learn about techniques for effective audience engagement",
"message": "I'd like to learn about techniques for effective audience engagement",
"response_id": "f0dc8482-d2ac-4ab6-9828-162951aed60d",
"generation_id": "bc7443cd-58ad-4115-9e51-e3cd66bbdcae",
"position": 4,
"active": true,
"role": "User"
},
...
]
Option 2: Managing the conversation history yourself
If you opt not to use the endpoint’s conversation history persistence feature, you can use the chat_history
parameter to manage the conversation history yourself.
The chat history is a list of multiple turns of messages from the user and the chatbot. Each item contains the role
, which can be either USER
or CHATBOT
, and the message containing the message string. The following is an example of a chat history.
chat_history = [{"role": "USER", "message": "What is 2 + 2"},
{"role": "CHATBOT", "message": "The answer is 4"},
{"role": "USER", "message": "Add 5 to that number" },
{"role": "CHATBOT", "message": "Sure. The answer is 9",
...
]
The following modifies the previous implementation by using chat_history
instead of conversation_id
for managing the conversation history.
# Initialize the chat history
chat_history = []
# Define the preamble
preamble_override = "You are an expert public speaking coach"
print('Starting the chat. Type "quit" to end.\n')
while True:
# User message
message = input("User: ")
# Typing "quit" ends the conversation
if message.lower() == 'quit':
print("Ending chat.")
break
# Chatbot response
response = co.chat(message=message,
preamble_override=preamble_override,
stream=True,
chat_history=chat_history)
chatbot_response = ""
print("Chatbot: ", end='')
for event in response:
if event.event_type == "text-generation":
print(event.text, end='')
chatbot_response += event.text
print("\n")
# Add to chat history
chat_history.extend(
[{"role": "USER", "message": message},
{"role": "CHATBOT", "message": chatbot_response}]
)
Other Parameters
And with that, we have built a simple chatbot that can respond to user messages and maintain the context of the conversation.
There are a few more parameters that we have not mentioned. These include:
model
: This can be one of the existing Cohere models or the full ID for a finetuned custom model. Compatible Cohere models arecommand
andcommand-light
, as well as the experimentalcommand-nightly
andcommand-light-nightly
variants. The default iscommand
.temperature:
Lower temperatures mean fewer random generations and higher temperatures mean more random generations. The default is 0.3, and the range is between 0 and 5.max_tokens
: This sets the limit to the maximum number of tokens to generate. By default, the value is the maximum number of possible tokens until the maximum context length is reached. The model will automatically stop generating once it reaches a natural stopping point. Thus, in most cases, you don’t have to define this parameter, unless you want to cap each generation to a specific number of tokens.
That’s a Wrap
This chapter showed how to build a simple chatbot using the Chat endpoint and how to configure the chatbot, such as overriding the preamble, building the chat history, streaming the response, and modifying the parameters.
But that’s only one part of the Chat endpoint. If you are looking to build chatbots that can connect to external data and ground their responses with this data, you’ll need the RAG component of the Chat endpoint. We’ll cover this in the next chapter.
Get started by creating a Cohere account now.
About LLM University
Our comprehensive NLP curriculum aims to equip you with the skills to develop your own AI-powered applications. We cater to learners from all backgrounds, covering everything from the basics to the most advanced topics in large language models (LLMs). Plus, you'll have the opportunity to work on hands-on exercises, allowing you to build and deploy your very own solutions. Take a course today.
This LLM University module on Chat with Retrieval-Augmented Generation (RAG) consists of the following chapters:
- Foundations of Chat and RAG
- Using the Chat endpoint (this chapter)
- Using the Chat endpoint with RAG in document mode
- Using the Chat endpoint with RAG in connector mode
- Creating custom models