Concepts

What are comparisons?

Comparisons help you analyze performance differences between different LLM configurations, experiments, or time periods. Instead of looking at results in isolation, comparisons show you side-by-side metrics so you can make data-driven decisions about what works best.

You can compare:

Different models (GPT-4 vs Claude vs Llama)
Different prompts or system messages
Different experiments or configurations
Performance over time periods

Key components

Performance snapshots

A snapshot is a performance report for a specific configuration at a point in time. Think of it as freezing your evaluation results so you can compare them later.

Snapshots capture:

Aggregate scores across evaluators
Pass/fail rates
Sample outputs and examples
Metadata like model, timestamp, and tags

Diff view

The diff view shows how outputs differ between two configurations when given identical inputs. This makes it easy to:

See exactly where outputs diverge
Understand how prompt changes affect responses
Spot regressions or improvements

Filters

Filters let you create focused snapshots by narrowing down what data to include:

Time range: Compare performance across different time periods
Evaluation criteria: Focus on specific evaluators
Tags: Group by custom labels
Projects: Compare across different projects
Experiments: Analyze experiment results
Datasets: Filter by specific test datasets

How comparisons work

Using comparisons is straightforward:

Create snapshots: Define filters to create performance snapshots for each configuration
Add snapshots: Compare up to 5 snapshots side-by-side
Analyze: View aggregate metrics, score distributions, and pass/fail rates
Drill down: Use diff view to understand specific differences between outputs

Concepts

What are comparisons?

Key components

Performance snapshots

Diff view

Filters

How comparisons work

Common use cases

Model selection

Prompt optimization

Regression testing

Cost-performance tradeoffs

On this page