Benchmarking Models

Frontier labs continue to release new models with claims of improved benchmark performance. As these models emerge, application developers need reliable ways to evaluate how they perform on both standard benchmarks and their own task-specific datasets.

Patronus Experiments enables developers to compare models and prompts side by side using traditional evaluations such as SWEBench, MMLU, and Humanity’s Last Exam, or with custom golden data brought into the platform.

0. Initialize Environment

We’ll use the OpenAI SDK along with Patronus. Start by importing the required packages, initializing a Patronus project, and instrumenting OpenAI so all requests are traced.

# --------------------------
# Patronus tracing
# --------------------------
from openinference.instrumentation.openai import OpenAIInstrumentor
import patronus
from patronus.evals import RemoteEvaluator
from patronus.experiments import run_experiment
from patronus.prompts import Prompt, push_prompt, load_prompt
from patronus.datasets import RemoteDatasetLoader
 
import textwrap
 
# --------------------------
# OpenAI client
# --------------------------
from openai import OpenAI
 
 
OPENAI_API_KEY = ""
if not OPENAI_API_KEY:
    raise RuntimeError("Please set OPENAI_API_KEY in your environment.")
 
client = OpenAI(api_key=OPENAI_API_KEY)
 
PROJECT_NAME = "cf-benchmark-models"
patronus.init(integrations=[OpenAIInstrumentor()], project_name=PROJECT_NAME)
log = patronus.get_logger()

1. Load Eval Data

Next, we’ll load FinanceBench, an eval created in-house by the Patronus research team. You can also use standard benchmarks like MMLU or define your own golden dataset.

# Load a dataset from the Patronus platform using its name
fb_remote_dataset = RemoteDatasetLoader("financebench")

2. Load Prompts and Define Experiment Task

Next, we'll define our system and user prompts, and load them to the Patronus platform.

default_system = 'You are an expert at answering financial questions. Use the given context to answer.'
default_user = textwrap.dedent("Context:\n{task_context}\n\nUser question: {task_input}")
 
# Create a new prompt
system_prompt = Prompt(
    name=f"{PROJECT_NAME}/question-answering/system",
    body=default_system,
    description="System prompt for RAG QA chatbot",
)
 
user_prompt = Prompt(
    name=f"{PROJECT_NAME}/question-answering/user",
    body=default_user,
    description="System prompt for RAG QA chatbot",
)
 
# Push the prompt to Patronus
loaded_prompt_system = push_prompt(system_prompt)
loaded_promot_user = push_prompt(user_prompt)
 
# Pull prompts to use as model inputs
system_prompt = load_prompt(name=f"{PROJECT_NAME}/question-answering/system")
user_prompt = load_prompt(name=f"{PROJECT_NAME}/question-answering/user")

Next, we’ll write a simple task that calls the OpenAI API. The task uses the model input and retrieved context for each row of eval data to generate a model response.

def qa_task(row, **kwargs) -> str:
    """query GPT-5 with context"""
    response = client.chat.completions.create(
        model="gpt-5",
        messages=[
            {"role": "system", "content": system_prompt.render()},
            {"role": "user", "content": user_prompt.render(task_input = row.task_input, task_context = row.task_context)} 
        ],
    )
    return response.choices[0].message.content.strip()

3. Run Experiment

We can now run an experiment to track how many eval questions GPT-5 gets right, using a fuzzy match evaluator to compare outputs against gold answers.

Notice that we add tags and a specific experiment name to distinguish this run from future runs.

# use await for jupyter notebook friendly handling
await run_experiment(
    project_name=PROJECT_NAME,
    dataset=fb_remote_dataset,
    task=qa_task,
    evaluators=[
        RemoteEvaluator("judge", "patronus:fuzzy-match").load(),    
    ],
    tags={"dataset-id": "financebench", "model": "GPT-5"},
    experiment_name="Benchmark GPT-5 on FinanceBench",
)

With results for GPT-5 in hand, we can compare to GPT-4.1 by redefining the task and running another experiment with updated tags.

def qa_task(row, **kwargs) -> str:
    """query GPT-4.1 with context"""
    response = client.chat.completions.create(
        model="gpt-4.1",
        messages=[
            {"role": "system", "content": system_prompt.render()},
            {"role": "user", "content": user_prompt.render(
                task_input=row.task_input,
                task_context=row.task_context
            )}
        ],
    )
    return response.choices[0].message.content.strip()
 
await run_experiment(
    project_name=PROJECT_NAME,
    dataset=fb_remote_dataset,
    task=qa_task,
    evaluators=[
        RemoteEvaluator("judge", "patronus:fuzzy-match").load(),    
    ],
    tags={"dataset-id": "financebench", "model": "GPT-4.1"},
    experiment_name="Benchmark GPT-4.1 on FinanceBench",
)

4. View Comparison in Patronus UI

After both experiments are complete, we can compare results in the Patronus UI. By adding two snapshots and using filters to select our experiments, we see that surprisingly GPT-4.1 outperformed GPT-5 on this domain-specific eval.

Comapre Results

We can also:

Compare outputs side-by-side
Add preferences
View judge explanations for each decision

Side by Side Comparison of Outputs

Wrap Up

This flow — import eval data → define a task → run experiment → change model → re-run — is the standard loop for benchmarking model performance with Patronus.

It can also be extended to measure how different prompts, temperatures, or retrieved context affect performance on real-world tasks.