Quick Start - Run an experiment

An Experiment is a group of Evaluations, typically executed on a Dataset. Each experiment consists of the following components:

Dataset
Evaluation criteria (e.g., exact match, LLM judge metrics)
Task (optional)

Experiments can help answer questions like "Is GPT-4o or Claude more performant on my task?", or "Does modifying my system prompt increase task accuracy?"

The experiments view supports aggregation over a number of metrics to compare differences in LLM performance on the same dataset. The below visualization shows exact match accuracy on SimpleQA for different models across experiments. We see that accuracy is impacted by the choice of underlying model (Claude Sonnet vs. OpenAI GPT-4o), as well as other factors such as updates to prompts and generation parameters.

Run an Experiment

You can kick off experiments and quickly start iterating on LLM performance with just one script execution. You can skip to the full code here.

1. Install Patronus SDK

You can use our Python SDK to run batched evaluations and track experiments. If you prefer to run batched evaluations in a different language, follow our API reference guide.

pip install patronus

2. Configure an API key

3. Set Environment Variables

If you do not have a Patronus API Key, see our quick start here for how to create one.

export PATRONUS_API_KEY=<YOUR_API_KEY>

4. Collect the following inputs for evaluation

Dataset: The inputs required for evaluation. You can load a dataset on Patronus in the following ways or use a Patronus hosted dataset.
- Uploading through the platform
- Uploading with data adapters
- Using a Patronus dataset
Task (optional): The task definition is needed when a dataset does not contain AI outputs to be evaluated. For example, a dataset might contain user queries but not the generations of your AI system. While the output is typically generated by a LLM, it can be any part of your AI system, such as the retrieved contexts or processed user queries.
The task executes the workflow that we are testing. A task is a functional unit of work in your AI workflow. For example:
- An LLM call in an agent execution
- A retriever that returns documents for a user query
- Text chunks from a PDF parsing library
- Results of a DB query
- Response from an external tool call

Here is an example showing a model call to return some output from the user queries. More here

import os
from openai import OpenAI
from patronus.evals import evaluator
from patronus.experiments import run_experiment, Row, TaskResult
 
# Initialize OpenAI client
oai = OpenAI()
 
# Define a task function
def call_gpt(row: Row, **kwargs) -> str:
    model = "gpt-4o-mini"
    response = (
        oai.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": "You are a helpful assistant. You must respond with only the answer. Do not provide an explanation."},
                {
                    "role": "user",
                    "content": f"Question: {row.task_input}\nContext: {row.task_context}"
                },
            ],
        )
        .choices[0]
        .message.content
    )
    return response

Evaluators: The evaluation criteria used to assess our task.
- Define an evaluator in code
  - See a simple example here and more complex example here
- Define a evaluator on the platform navigating to "Define your own Criteria" on the "Evaluators" tab. Then enter the evaluation criteria prompt. Read more here
- Use an existing Patronus evaluator from the platform. More details here

5. Run an Experiment

Plug in the dataset, evaluator, and task (optional) you defined in #4. Here is an example script where a custom task, evaluator, and a few dataset samples have already been defined.

import os
from openai import OpenAI
from patronus import init
from patronus.evals import RemoteEvaluator
from patronus.experiments import run_experiment, FuncEvaluatorAdapter
 
# Initialize Patronus and OpenAI
init(api_key=os.environ.get("PATRONUS_API_KEY"))
oai = OpenAI()
 
# Define a task function
def call_gpt(row, **kwargs):
    model = "gpt-4o"
    response = (
        oai.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": "You are a helpful assistant. Please keep your responses very short and to the point."},
                {
                    "role": "user",
                    "content": f"Question: {row.task_input}\nContext: {row.task_context}"
                },
            ],
            temperature=0.9
        )
        .choices[0]
        .message.content
    )
    return response
 
# Create a sample dataset
dataset = [
    {
        "task_input": "What is the capital of France?",
        "task_context": "France is a country in Western Europe with several overseas territories.",
        "gold_answer": "Paris"
    },
    {
        "task_input": "When was the Declaration of Independence signed?",
        "task_context": "The American Revolution led to independence from Great Britain.",
        "gold_answer": "July 4, 1776"
    }
]
 
# Set up evaluators
fuzzy_match = RemoteEvaluator("judge", "patronus:fuzzy-match")
 
# Run the experiment
experiment = run_experiment(
    dataset=dataset,
    task=call_gpt,
    evaluators=[fuzzy_match],
    tags={"dataset_type": "qa_sample", "model": "gpt-4o", "temperature": "0.9"},
    experiment_name="GPT-4o Experiment"
)
 
# Export results
df = experiment.to_dataframe()
df.to_json("results.json", orient="records")

Your experiment run will generate a summary report and a link to the platform within your code editor.

6. Compare Experiment Outputs

You can compare historical experiments in Comparisons to get useful insights. Run through this process to create comparisons against datasets, models, evaluators, and tasks. For example, we see that GPT-4o-mini performs worse than Claude 3.5 Sonnet in our experiments.

You can view row-wise output differences as well! This highlights differences in LLM outputs across runs. For example, in this case GPT-4o-mini returned the wrong answer, but Claude 3.5 answered correctly.