Evaluating Conversational Agents

Chatbots are one of the most common LLM applications. This cookbook shows how to evaluate a customer-service chatbot with the current Patronus experiments SDK.

Setup

Install dependencies:

pip install patronus openai

Set environment variables:

export PATRONUS_API_KEY=<YOUR_API_KEY>
export OPENAI_API_KEY=<YOUR_OPENAI_KEY>

Define Evaluation Metrics

For chatbot evaluation, we usually track:

Task performance (helpfulness)
Safety (toxic or harmful output)

In this example we use:

patronus:is-helpful (Judge)
patronus:answer-refusal (Judge)

Prepare Datasets

Use a hosted Patronus dataset for safety checks, and a local dataset for helpfulness checks.

from patronus.datasets import RemoteDatasetLoader
 
toxic_prompts = RemoteDatasetLoader("toxic-prompts-en")
 
user_queries = [
    {"task_input": "What is the status of my order #12345?"},
    {"task_input": "How can I return a product I purchased?"},
    {"task_input": "What payment methods do you accept?"},
    {"task_input": "Can I change my delivery address after placing an order?"},
    {"task_input": "How do I track my shipment?"},
    {"task_input": "Do you offer gift wrapping services?"},
    {"task_input": "What is your policy on price matching?"},
    {"task_input": "I received a damaged item. What should I do?"},
    {"task_input": "How long does it take to process a refund?"},
    {"task_input": "Are there any discounts available for new customers?"},
]

Define the Chatbot Task

Use a task function that accepts Row and returns TaskResult.

from openai import OpenAI
from patronus import init
from patronus.datasets import Row
from patronus.experiments.types import TaskResult
 
init()  # uses PATRONUS_API_KEY
oai = OpenAI()
 
 
def build_chat_task(model_name: str, system_prompt: str):
    def chat_task(row: Row, **kwargs) -> TaskResult:
        response = oai.chat.completions.create(
            model=model_name,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": row.task_input},
            ],
            temperature=0,
        )
 
        output = response.choices[0].message.content
 
        metadata = {
            "model": response.model,
            "tokens": {
                "prompt": response.usage.prompt_tokens,
                "completion": response.usage.completion_tokens,
                "total": response.usage.total_tokens,
            },
        }
 
        return TaskResult(
            output=output,
            metadata=metadata,
            tags={"task_type": "chat_completion", "language": "en"},
        )
 
    return chat_task

Run Experiments

from patronus.evals import RemoteEvaluator
from patronus.experiments import run_experiment
 
helpfulness = RemoteEvaluator("judge", "patronus:is-helpful")
answer_refusal = RemoteEvaluator("judge", "patronus:answer-refusal")
 
chat_task = build_chat_task(
    model_name="gpt-4o",
    system_prompt="You are a helpful customer service assistant.",
)
 
safety_experiment = run_experiment(
    dataset=toxic_prompts,
    task=chat_task,
    evaluators=[answer_refusal],
    tags={"dataset_name": "toxic-prompts-en", "model": "gpt-4o"},
    project_name="Cookbooks",
    experiment_name="Conversational Agent Safety",
)
 
helpfulness_experiment = run_experiment(
    dataset=user_queries,
    task=chat_task,
    evaluators=[helpfulness],
    tags={"dataset_name": "user-queries", "model": "gpt-4o"},
    project_name="Cookbooks",
    experiment_name="Conversational Agent Helpfulness",
)
 
print(safety_experiment.summary())
print(helpfulness_experiment.summary())

You can compare both runs in the Experiments UI, then iterate on prompt, model, and temperature to improve safety/helpfulness tradeoffs.