Benchmarking Models
Frontier labs continue to release new models with claims of improved benchmark performance. As these models emerge, application developers need reliable ways to evaluate how they perform on both standard benchmarks and their own task-specific datasets.
Patronus Experiments enable developers to compare models and prompts side by side using traditional evaluations such as SWEBench, MMLU, and Humanity's Last Exam, or with custom golden data brought into the platform.
0. Initialize Environment
We’ll use the OpenAI SDK along with Patronus. Start by importing the required packages, initializing a Patronus project, and instrumenting OpenAI so all requests are traced.
1. Define Datasets and Models
We'll use the pre-loaded FinanceBench eval set from the Patronus platform.
You can also use standard benchmarks like MMLU or define your own golden dataset.
2. Define Experiment Task and Experiment
We’ll write a simple task that calls the OpenAI API. The task uses the model input and retrieved context for each row of eval data to generate a model response. We'll then call this task in a loop for each set of eval data and model.
We can now run an experiment to track how many eval questions each model gets correct, using a fuzzy match evaluator to compare outputs against gold answers.
Notice that we add tags and a specific experiment name to distinguish this run from future runs.
3. View Comparison in Patronus UI
After both experiments are complete, we can compare results in the Patronus UI. By adding two snapshots and using filters to select our experiments, we see that, surprisingly, GPT-4.1 outperformed GPT-5 on this domain-specific eval.

We can also:
- Compare outputs side-by-side
- Add preferences
- View judge explanations for each decision

Wrap Up
This flow — import eval data → define a task → run experiment → change model → re-run — is the standard loop for benchmarking model performance with Patronus.
It can also be extended to measure how different prompts, temperatures, or retrieved context affect performance on real-world tasks.
