Our docs got a refresh! Check out the new content and improved navigation. For detailed API reference see our Python SDK docs and TypeScript SDK.
Description
GuidesWorkflows

Benchmarking Models

Frontier labs continue to release new models with claims of improved benchmark performance. As these models emerge, application developers need reliable ways to evaluate how they perform on both standard benchmarks and their own task-specific datasets.

Patronus Experiments enable developers to compare models and prompts side by side using traditional evaluations such as SWEBench, MMLU, and Humanity's Last Exam, or with custom golden data brought into the platform.

The full script in this guide is a single file — the three code blocks below concatenate into a runnable module.

Setup

Install dependencies:

pip install 'patronus[experiments]' openai openinference-instrumentation-openai

Set environment variables:

export PATRONUS_API_KEY=<YOUR_API_KEY>
export OPENAI_API_KEY=<YOUR_OPENAI_KEY>

0. Initialize Environment

Import the required packages, initialize a Patronus project, and instrument OpenAI so all requests are traced.

import asyncio
import os
 
from openai import OpenAI
from openinference.instrumentation.openai import OpenAIInstrumentor
 
import patronus
from patronus.datasets import RemoteDatasetLoader
from patronus.evals import RemoteEvaluator
from patronus.experiments import run_experiment
 
OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")
if not OPENAI_API_KEY:
    raise RuntimeError("Please set OPENAI_API_KEY in your environment.")
 
PATRONUS_PROJECT_NAME = "benchmark-models"
 
openai_client = OpenAI(api_key=OPENAI_API_KEY)
patronus.init(
    project_name=PATRONUS_PROJECT_NAME,
    integrations=[OpenAIInstrumentor()],
)

1. Define Datasets and Models

We'll use the pre-loaded FinanceBench eval set from the Patronus platform.

You can also use standard benchmarks like MMLU or define your own golden dataset.

datasets = ["financebench"]  # add "halubench-pubmedqa", "legal-confidentiality", etc.
 
# Use concrete, currently-available model IDs; replace with whatever you're benchmarking.
models = ["gpt-5", "gpt-5-mini"]

2. Define Experiment Task and Experiment

We write a task that calls the OpenAI API for each eval row.

Because models is a loop variable, we wrap the task definition in a factory (make_qa_task) so each coroutine captures its own model rather than the shared loop binding.

def make_qa_task(model: str):
    """Return a qa_task bound to a specific model."""
 
    def qa_task(*, row, parent, tags) -> str:
        system_prompt = "Answer the user's question as accurately as possible."
        user_prompt = (
            f"{row.task_input}\n\n"
            f"Retrieved context: {row.task_context}"
        )
        response = openai_client.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": user_prompt},
            ],
        )
        return response.choices[0].message.content.strip()
 
    return qa_task
 
 
async def main():
    for dataset in datasets:
        for model in models:
            await run_experiment(
                project_name=PATRONUS_PROJECT_NAME,
                dataset=RemoteDatasetLoader(dataset),
                task=make_qa_task(model),
                evaluators=[RemoteEvaluator("judge", "patronus:fuzzy-match").load()],
                tags={"dataset-id": dataset, "model": model},
                experiment_name=f"Benchmark {model} on {dataset}",
            )
 
 
if __name__ == "__main__":
    asyncio.run(main())

3. View Comparison in Patronus UI

After the experiments complete, open the Patronus UI to compare results side by side. Adding two experiment snapshots and filtering by model lets you see how each model ranks on the same dataset.

Compare Results

You can also:

  • Compare outputs side-by-side
  • Add preferences
  • View judge explanations for each decision

Side by Side Comparison of Outputs

Wrap Up

This flow — import eval data → define a task → run experiments → change model → re-run — is the standard loop for benchmarking model performance with Patronus.

It extends naturally to comparing prompts, temperatures, or retrieved-context variants on real-world tasks.

On this page