Benchmarking Models

Frontier labs continue to release new models with claims of improved benchmark performance. As these models emerge, application developers need reliable ways to evaluate how they perform on both standard benchmarks and their own task-specific datasets.

Patronus Experiments enable developers to compare models and prompts side by side using traditional evaluations such as SWEBench, MMLU, and Humanity's Last Exam, or with custom golden data brought into the platform.

0. Initialize Environment

We’ll use the OpenAI SDK along with Patronus. Start by importing the required packages, initializing a Patronus project, and instrumenting OpenAI so all requests are traced.

# --------------------------
# Patronus tracing
# --------------------------
from openinference.instrumentation.openai import OpenAIInstrumentor
import patronus
from patronus.evals import RemoteEvaluator
from patronus.experiments import run_experiment
from patronus.prompts import Prompt, push_prompt, load_prompt
from patronus.datasets import RemoteDatasetLoader
 
import textwrap
 
# --------------------------
# OpenAI client
# --------------------------
from openai import OpenAI
 
 
OPENAI_API_KEY = ""
if not OPENAI_API_KEY:
    raise RuntimeError("Please set OPENAI_API_KEY in your environment.")
 
client = OpenAI(api_key=OPENAI_API_KEY)
 
PROJECT_NAME = "cf-benchmark-models"
patronus.init(integrations=[OpenAIInstrumentor()], project_name=PROJECT_NAME)
log = patronus.get_logger()

1. Define Datasets and Models

We'll use the pre-loaded FinanceBench eval set from the Patronus platform.

You can also use standard benchmarks like MMLU or define your own golden dataset.

 
datasets = [
    "financebench",
    # "halubench-pubmedqa", OPTIONAL EXTRA DATASETS
    # "legal-confidentiality"
]
 
models = [
    "gpt-5",
    "gpt-4.1"
]

2. Define Experiment Task and Experiment

We’ll write a simple task that calls the OpenAI API. The task uses the model input and retrieved context for each row of eval data to generate a model response. We'll then call this task in a loop for each set of eval data and model.

We can now run an experiment to track how many eval questions each model gets correct, using a fuzzy match evaluator to compare outputs against gold answers.

Notice that we add tags and a specific experiment name to distinguish this run from future runs.

 
for dataset in datasets:
    for model in models:
        def qa_task(row, **kwargs) -> str:
            """query with context"""
            system_prompt = f"Answer the user questions as accurately as possibles"
            user_prompt = f"{row.task_input} \n\n Retreived context: {row.task_context}"
            response = client.chat.completions.create(
                model=model,
                messages=[
                    {"role": "system", "content":system_prompt},
                    {"role": "user", "content":user_prompt}
                ],
            )
            return response.choices[0].message.content.strip()
        
        await run_experiment(
            project_name=PROJECT_NAME,
            dataset=RemoteDatasetLoader(dataset),
            task=qa_task,
            evaluators=[
                RemoteEvaluator("judge", "patronus:fuzzy-match").load(),    
            ],
            tags={"dataset-id": dataset, "model": model},
            experiment_name=f"Benchmark {model} on {dataset}. ",
        )

3. View Comparison in Patronus UI

After both experiments are complete, we can compare results in the Patronus UI. By adding two snapshots and using filters to select our experiments, we see that, surprisingly, GPT-4.1 outperformed GPT-5 on this domain-specific eval.

Comapre Results

We can also:

Compare outputs side-by-side
Add preferences
View judge explanations for each decision

Side by Side Comparison of Outputs

Wrap Up

This flow — import eval data → define a task → run experiment → change model → re-run — is the standard loop for benchmarking model performance with Patronus.

It can also be extended to measure how different prompts, temperatures, or retrieved context affect performance on real-world tasks.