Quick Start - Run an experiment

An Experiment is a group of Evaluations, typically executed on a Dataset. Each experiment consists of the following components

Dataset
Evaluation criteria eg. exact match, LLM judge metrics
Task (optional)

Experiments can help answer questions like "Is GPT-4o or Claude more performant on my task?", or "Does modifying my system prompt increase task accuracy?"

The experiments view supports aggregation over a number of metrics to compare differences in LLM performance on the same dataset. The below visualization shows exact match accuracy on SimpleQA for different models across experiments. We see that accuracy is impacted by the choice of underlying model (Claude Sonnet vs. OpenAI GPT-4o), as well as other factors such as updates to prompts and generation parameters.

Run an Experiment

You can kick off experiments and quickly start iterating on LLM performance with just one script execution. You can skip to the full code here.

1. Install Patronus Module

You can use our Python SDK to run batched evaluations and track experiments. If you prefer to run batched evaluations in a different language, follow our API reference guide.

pip install patronus

2. Configure an API key

3. Set Environment Variables

If you do not have a Patronus API Key, see our quick start here for how to create one.

export PATRONUS_API_KEY=<YOUR_API_KEY>

4. Collect the following inputs for evaluation

Dataset: The inputs required for evaluation. You can load a dataset on Patronus in the following ways or use a Patronus hosted dataset.
- Uploading through the platform
- Uploading with data adaptors
- Using a Patronus dataset
Task (optional): The task definition is needed when a dataset is does not contain AI outputs to be evaluated. For example, a dataset might contain user queries but not the generations of your AI system. While the output is typically generated by a LLM, it can be any part of your AI system, such as the retrieved contexts or processed user queries.
The task executes the workflow that we are testing. A task is a functional unit of work in your AI workflow. For example,
- An LLM call in an agent execution
- A retriever that returns documents for a user query
- Text chunks from a PDF parsing library
- Results of a DB query
- Response from an external tool call

Here is an example showing a model call to return some output from the user queries. More here

Python

from openai import OpenAI
from patronus import Client, task, Row, TaskResult, evaluator
 
oai = OpenAI()
cli = Client()
 
@task
def call_gpt(evaluated_model_input: str, evaluated_model_retrieved_context: list[str]) -> str:
    model = "gpt-4o-mini"
    evaluated_model_output = (
        oai.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": "You are a helpful assistant. You must respond with only the answer. Do not provide an explanation."},
                {
                    "role": "user",
                    "content": f"Question: {evaluated_model_input}\nContext: {evaluated_model_retrieved_context}"
                },
            ],
        )
        .choices[0]
        .message.content
    )
    return evaluated_model_output

Evaluators: The evaluation criteria used to assess our task.
- Define an evaluator in code
  - See a simple example here and more complex example here
- Define a evaluator on the platform navigating to "Define your own Criteria" on the "Evaluators" tab. Then enter the evaluation criteria prompt. Read more here
- Use an existing Patronus evaluator from the platform. More details here

5. Run an Experiment

Plug in the dataset, evaluator and task (optional) you defined in #4. Here is an example script where a custom task, evaluator, and a few dataset samples have already been defined.

Python

 
from openai import OpenAI
from patronus import Client, task
 
oai = OpenAI()
cli = Client()
 
@task
def call_gpt(evaluated_model_input: str, evaluated_model_retrieved_context: list[str]) -> str:
    model = "gpt-4o"
    evaluated_model_output = (
        oai.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": "You are a helpful assistant. Please keep your responses very short and to the point."},
                {
                    "role": "user",
                    "content": f"Question: {evaluated_model_input}\nContext: {evaluated_model_retrieved_context}"
                },
            ],
            temperature=0.9
        )
        .choices[0]
        .message.content
    )
    return evaluated_model_output
 
financebench_dataset = cli.remote_dataset("financebench")
fuzzy_match = cli.remote_evaluator("judge-small", "patronus:fuzzy-match")
 
# The framework will handle loading automatically when passed to an experiment
results = cli.experiment(
    "GPT-4o",
    data=financebench_dataset,
    task=call_gpt,
    evaluators=[fuzzy_match],
    tags={"dataset_type": "financebench", "model": "gpt-4o", "temperature": "0.9"},
    experiment_name="FinanceBench Experiment",
)
df = results.to_dataframe()
df.to_json("results.json", orient="records")

Your experiment run will generate a summary report and a link to the platform within your code editor.

6. Compare Experiment Outputs

You can compare historical experiments in Comparisons to get useful insights. Run through this process to create comparisons against datasets, models, evaluators, and tasks. For example, we see that GPT-4o-mini performs worse than Claude 3.5 Sonnet in our experiments.

You can view row-wise output differences as well! This highlights differences in LLM outputs across runs. For example, in this case GPT-4o-mini returned the wrong answer, but Claude 3.5 answered correctly.