Our docs got a refresh! Check out the new content and improved navigation. For detailed API reference see our Python SDK docs and TypeScript SDK.
Description
GuidesCookbooks

Evaluating Conversational Agents

Chatbots are one of the most common LLM applications. This cookbook shows how to evaluate a customer-service chatbot with the current Patronus experiments SDK.

Setup

Install dependencies:

pip install patronus openai

Set environment variables:

export PATRONUS_API_KEY=<YOUR_API_KEY>
export OPENAI_API_KEY=<YOUR_OPENAI_KEY>

Define Evaluation Metrics

For chatbot evaluation, we usually track:

  1. Task performance (helpfulness)
  2. Safety (toxic or harmful output)

In this example we use:

  • patronus:is-helpful (Judge)
  • patronus:answer-refusal (Judge)

Prepare Datasets

Use a hosted Patronus dataset for safety checks, and a local dataset for helpfulness checks.

from patronus.datasets import RemoteDatasetLoader
 
toxic_prompts = RemoteDatasetLoader("toxic-prompts-en")
 
user_queries = [
    {"task_input": "What is the status of my order #12345?"},
    {"task_input": "How can I return a product I purchased?"},
    {"task_input": "What payment methods do you accept?"},
    {"task_input": "Can I change my delivery address after placing an order?"},
    {"task_input": "How do I track my shipment?"},
    {"task_input": "Do you offer gift wrapping services?"},
    {"task_input": "What is your policy on price matching?"},
    {"task_input": "I received a damaged item. What should I do?"},
    {"task_input": "How long does it take to process a refund?"},
    {"task_input": "Are there any discounts available for new customers?"},
]

Define the Chatbot Task

Use a task function that accepts Row and returns TaskResult.

from openai import OpenAI
from patronus import init
from patronus.datasets import Row
from patronus.experiments.types import TaskResult
 
init()  # uses PATRONUS_API_KEY
oai = OpenAI()
 
 
def build_chat_task(model_name: str, system_prompt: str):
    def chat_task(row: Row, **kwargs) -> TaskResult:
        response = oai.chat.completions.create(
            model=model_name,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": row.task_input},
            ],
            temperature=0,
        )
 
        output = response.choices[0].message.content
 
        metadata = {
            "model": response.model,
            "tokens": {
                "prompt": response.usage.prompt_tokens,
                "completion": response.usage.completion_tokens,
                "total": response.usage.total_tokens,
            },
        }
 
        return TaskResult(
            output=output,
            metadata=metadata,
            tags={"task_type": "chat_completion", "language": "en"},
        )
 
    return chat_task

Run Experiments

from patronus.evals import RemoteEvaluator
from patronus.experiments import run_experiment
 
helpfulness = RemoteEvaluator("judge", "patronus:is-helpful")
answer_refusal = RemoteEvaluator("judge", "patronus:answer-refusal")
 
chat_task = build_chat_task(
    model_name="gpt-4o",
    system_prompt="You are a helpful customer service assistant.",
)
 
safety_experiment = run_experiment(
    dataset=toxic_prompts,
    task=chat_task,
    evaluators=[answer_refusal],
    tags={"dataset_name": "toxic-prompts-en", "model": "gpt-4o"},
    project_name="Cookbooks",
    experiment_name="Conversational Agent Safety",
)
 
helpfulness_experiment = run_experiment(
    dataset=user_queries,
    task=chat_task,
    evaluators=[helpfulness],
    tags={"dataset_name": "user-queries", "model": "gpt-4o"},
    project_name="Cookbooks",
    experiment_name="Conversational Agent Helpfulness",
)
 
print(safety_experiment.summary())
print(helpfulness_experiment.summary())

You can compare both runs in the Experiments UI, then iterate on prompt, model, and temperature to improve safety/helpfulness tradeoffs.

On this page