Comparisons
Our Comparisons feature allows you to compare LLM performance on different configurations and experiments. A Snapshot is an evaluation summary report for a specific LLM configuration.
1. Create a Snapshot
To create a new performance snapshot, you need to have some results logged. If not, go ahead and run experiments from our Experiements Quick Start guide.
Start using the Comparisons feature by navigating to the Comparisons tab, or click here.
Click on "Search" to set filters that describe what you want to compare. Note that you must select a project ID. Examples of filters include:
- time range
- evaluation criteria
- tags
- scores
- projects
- experiments
- datasets
When you add multiple performance snapshots, you can see their performance side-by-side. Use the side-by-side view to determine the best LLM for your GenAI application and track changes in LLM performance over time.
To compare with another Performance Snapshot, click on "Add snapshot". In the example above, we're comparing an experiment running Claude-3.5 against GPT-4o.
You can see the total pass and fail percentages on evaluations and a breakdown of exactly which evaluators performed poorly. You can give your performance snapshots descriptive names and add up to 5 snapshots at once.
2. Diff View
Patronus supports a diff view that shows the difference in outputs between two models or experiments on the same set of inputs. This can reveal important insights between different prompts and configuration parameters.
3. Export Results
The comparisons view shows you charts containing metrics visualizations as well as aggregate statistics and performance deltas. You can download these charts and results as PDF, which will contain an evaluation summary report for your performance snapshots. You can share this with your team to make informed business decisions!