Concepts
Understanding comparisons in Patronus AI
What are comparisons?
Comparisons help you analyze performance differences between different LLM configurations, experiments, or time periods. Instead of looking at results in isolation, comparisons show you side-by-side metrics so you can make data-driven decisions about what works best.
You can compare:
- Different models (GPT-4 vs Claude vs Llama)
- Different prompts or system messages
- Different experiments or configurations
- Performance over time periods
Key components
Performance snapshots
A snapshot is a performance report for a specific configuration at a point in time. Think of it as freezing your evaluation results so you can compare them later.
Snapshots capture:
- Aggregate scores across evaluators
- Pass/fail rates
- Sample outputs and examples
- Metadata like model, timestamp, and tags
Diff view
The diff view shows how outputs differ between two configurations when given identical inputs. This makes it easy to:
- See exactly where outputs diverge
- Understand how prompt changes affect responses
- Spot regressions or improvements
Filters
Filters let you create focused snapshots by narrowing down what data to include:
- Time range: Compare performance across different time periods
- Evaluation criteria: Focus on specific evaluators
- Tags: Group by custom labels
- Projects: Compare across different projects
- Experiments: Analyze experiment results
- Datasets: Filter by specific test datasets
How comparisons work
Using comparisons is straightforward:
- Create snapshots: Define filters to create performance snapshots for each configuration
- Add snapshots: Compare up to 5 snapshots side-by-side
- Analyze: View aggregate metrics, score distributions, and pass/fail rates
- Drill down: Use diff view to understand specific differences between outputs
Common use cases
Model selection
Compare different models on your dataset to select the best performer. This helps you balance quality, cost, and latency for your specific use case.
Prompt optimization
A/B test different prompt variations to find the most effective phrasing. Small changes in prompts can significantly impact output quality.
Regression testing
Track performance over time to catch regressions when you update models, prompts, or system configurations. This ensures new changes don't hurt existing performance.
Cost-performance tradeoffs
Balance evaluation scores with cost and latency metrics to find the optimal configuration. Sometimes a slightly cheaper or faster model performs nearly as well as a premium option.
