Quick Start: Run an Experiment 🪄

An experiment is an end to end batched evaluation that helps us answer questions such as "Is GPT-4o or Claude better for my task?", or "Does adding this sentence to my system prompt increase task accuracy?"

An experiment consists of several components:

  • Dataset: The inputs to the AI application that we want to test.
  • Task: The task executes the workflow that we are testing. A task is a functional unit of work in your AI workflow. For example,
    • An LLM call in an agent execution
    • A retriever that returns documents for a user query
    • Text chunks from a PDF parsing library
    • Results of a DB query
    • Response from an external tool call
  • Evaluators: The criteria that we are evaluating on, such as similarity to the gold label.

1. Install Patronus Module

You can use our python SDK to run batched evaluations and track experiments. If you prefer to run batched evaluations in a different language, follow our API reference guide.

pip install patronus

2. Set Environment Variables

If you do not have a Patronus API Key, see our quick start here for how to create one.

export PATRONUS_API_KEY=<YOUR_API_KEY>

3. Run an Experiment

Let's run a simple experiment that performs a simple transformation on the input and checks if the output matches the correct answer.

import os
from patronus import Client, task, evaluator, Row, TaskResult

client = Client(
    # This is the default and can be omitted
    api_key=os.environ.get("PATRONUS_API_KEY"),
)

@task
def my_task(row: Row):
    return f"{row.evaluated_model_input} world"


@evaluator
def exact_match(row: Row, task_result: TaskResult):
    return task_result.evaluated_model_output == row.evaluated_model_gold_answer


client.experiment(
    "Tutorial Project",
    experiment_name="Hello World Experiment",
    dataset=[
        {
            "evaluated_model_input": "Hello",
            "evaluated_model_gold_answer": "Hello World",
        },
    ],
    task=my_task,
    evaluators=[exact_match],
)

Here we have defined a few concepts

  • A Dataset containing an array of inputs and gold answers that is used in our evaluation
  • A Task representing our AI system. You can replace this with your LLM call.
  • A simple Evaluator that checks whether the output matches the gold answer

4. View Experiment Results

Your experiment run will generate a summary report and a link to the platform.

Opening the link to the Experiments page, we see that the experiment executed successfully and contains a failure because the output did not match the gold answer.


You can view and manage all experiments in the Experiments UI.

That's it! You can modify the task definition, evaluation criteria and compare the performance of different experiment runs. Happy experimenting 🚀