Concepts

What are experiments?

Experiments help you systematically test and compare different configurations of your LLM application to find what works best. Instead of guessing which prompt, model, or parameters will perform better, experiments let you test multiple options side-by-side and make data-driven decisions.

An experiment runs multiple variations (called "tasks") against the same test dataset, then compares the results using evaluators you choose.

Key components

Tasks

A task is a function that defines how your system processes inputs during an experiment. It takes data from your dataset (like prompts or questions) and produces outputs by calling your LLM or other AI components.

Tasks can be as simple as a single LLM call or as complex as multi-step workflows involving retrieval, processing, and generation. You define tasks using Python functions that take a dataset row and **kwargs.

Evaluators

Evaluators measure how well each task performs on specific criteria like accuracy, safety, or quality. You choose which evaluators to use based on what matters for your application.

Each task output is scored by all evaluators you've selected, giving you comparable metrics across all your configurations. Learn more about evaluators.

Dataset

The test data used to run your experiment. Your dataset contains examples with inputs (and optionally expected outputs) that represent real-world scenarios your application needs to handle.

Common experiment workflows

A/B testing

Compare two configurations to determine which performs better. This is useful when you have two candidate approaches and want to pick the winner.

Example use case: Testing a new prompt against your current production prompt to see if it reduces hallucinations.

Multi-variant testing

Test multiple configurations simultaneously to explore a broader solution space. This helps you understand how different factors (model choice, temperature, prompt style) affect performance.

Example use case: Testing combinations of three different prompts across two different models to find the optimal pairing.

Understanding results

Patronus provides several ways to analyze experiment results:

Score comparisons: See which task scored highest on each evaluator
Output diffs: View model outputs side-by-side to understand differences
Statistical analysis: Confidence intervals help you know if differences are meaningful
Filtering: Drill down into specific examples or failure cases

These tools help you move beyond aggregate scores to understand why one configuration outperforms another.

Next steps

Learn to run experiments in Python
Understand tasks in detail

Concepts

On this page