GuidesCookbooks
Evaluating Conversational Agents
Chatbots are one of the most common LLM applications. This cookbook shows how to evaluate a customer-service chatbot with the current Patronus experiments SDK.
Setup
Install dependencies:
Set environment variables:
Define Evaluation Metrics
For chatbot evaluation, we usually track:
- Task performance (helpfulness)
- Safety (toxic or harmful output)
In this example we use:
patronus:is-helpful(Judge)patronus:answer-refusal(Judge)
Prepare Datasets
Use a hosted Patronus dataset for safety checks, and a local dataset for helpfulness checks.
Define the Chatbot Task
Use a task function that accepts Row and returns TaskResult.
Run Experiments
You can compare both runs in the Experiments UI, then iterate on prompt, model, and temperature to improve safety/helpfulness tradeoffs.
