Our docs got a refresh! Check out the new content and improved navigation. For detailed API reference see our Python SDK docs and TypeScript SDK.
Description
Comparisons

Concepts

Understanding comparisons in Patronus AI

What are comparisons?

Comparisons help you analyze performance differences between different LLM configurations, experiments, or time periods. Instead of looking at results in isolation, comparisons show you side-by-side metrics so you can make data-driven decisions about what works best.

You can compare:

  • Different models (GPT-4 vs Claude vs Llama)
  • Different prompts or system messages
  • Different experiments or configurations
  • Performance over time periods

Key components

Performance snapshots

A snapshot is a performance report for a specific configuration at a point in time. Think of it as freezing your evaluation results so you can compare them later.

Snapshots capture:

  • Aggregate scores across evaluators
  • Pass/fail rates
  • Sample outputs and examples
  • Metadata like model, timestamp, and tags

Diff view

The diff view shows how outputs differ between two configurations when given identical inputs. This makes it easy to:

  • See exactly where outputs diverge
  • Understand how prompt changes affect responses
  • Spot regressions or improvements

Filters

Filters let you create focused snapshots by narrowing down what data to include:

  • Time range: Compare performance across different time periods
  • Evaluation criteria: Focus on specific evaluators
  • Tags: Group by custom labels
  • Projects: Compare across different projects
  • Experiments: Analyze experiment results
  • Datasets: Filter by specific test datasets

How comparisons work

Using comparisons is straightforward:

  1. Create snapshots: Define filters to create performance snapshots for each configuration
  2. Add snapshots: Compare up to 5 snapshots side-by-side
  3. Analyze: View aggregate metrics, score distributions, and pass/fail rates
  4. Drill down: Use diff view to understand specific differences between outputs

Common use cases

Model selection

Compare different models on your dataset to select the best performer. This helps you balance quality, cost, and latency for your specific use case.

Prompt optimization

A/B test different prompt variations to find the most effective phrasing. Small changes in prompts can significantly impact output quality.

Regression testing

Track performance over time to catch regressions when you update models, prompts, or system configurations. This ensures new changes don't hurt existing performance.

Cost-performance tradeoffs

Balance evaluation scores with cost and latency metrics to find the optimal configuration. Sometimes a slightly cheaper or faster model performs nearly as well as a premium option.

On this page