Our Python SDK got smarter. We developed a Typscript SDK too. We are updating our SDK code blocks. Python SDKhere.Typscript SDKhere.
Customer Flows
Benchmarking Models
Frontier labs continue to release new models with claims of improved benchmark performance. As these models emerge, application developers need reliable ways to evaluate how they perform on both standard benchmarks and their own task-specific datasets.
Patronus Experiments enables developers to compare models and prompts side by side using traditional evaluations such as SWEBench, MMLU, and Humanity’s Last Exam, or with custom golden data brought into the platform.
We’ll use the OpenAI SDK along with Patronus. Start by importing the required packages, initializing a Patronus project, and instrumenting OpenAI so all requests are traced.
Next, we’ll load FinanceBench, an eval created in-house by the Patronus research team.
You can also use standard benchmarks like MMLU or define your own golden dataset.
# Load a dataset from the Patronus platform using its namefb_remote_dataset = RemoteDatasetLoader("financebench")
Next, we'll define our system and user prompts, and load them to the Patronus platform.
default_system = 'You are an expert at answering financial questions. Use the given context to answer.'default_user = textwrap.dedent("Context:\n{task_context}\n\nUser question: {task_input}")# Create a new promptsystem_prompt = Prompt( name=f"{PROJECT_NAME}/question-answering/system", body=default_system, description="System prompt for RAG QA chatbot",)user_prompt = Prompt( name=f"{PROJECT_NAME}/question-answering/user", body=default_user, description="System prompt for RAG QA chatbot",)# Push the prompt to Patronusloaded_prompt_system = push_prompt(system_prompt)loaded_promot_user = push_prompt(user_prompt)# Pull prompts to use as model inputssystem_prompt = load_prompt(name=f"{PROJECT_NAME}/question-answering/system")user_prompt = load_prompt(name=f"{PROJECT_NAME}/question-answering/user")
Next, we’ll write a simple task that calls the OpenAI API. The task uses the model input and retrieved context for each row of eval data to generate a model response.
After both experiments are complete, we can compare results in the Patronus UI. By adding two snapshots and using filters to select our experiments, we see that surprisingly GPT-4.1 outperformed GPT-5 on this domain-specific eval.
This flow — import eval data → define a task → run experiment → change model → re-run — is the standard loop for benchmarking model performance with Patronus.
It can also be extended to measure how different prompts, temperatures, or retrieved context affect performance on real-world tasks.