Quick Start - Run an experiment
An Experiment is a group of Evaluations, typically executed on a Dataset. Each experiment consists of the following components
- Dataset
- Evaluation criteria eg. exact match, LLM judge metrics
- Task (optional)
Experiments can help answer questions like "Is GPT-4o or Claude more performant on my task?", or "Does modifying my system prompt increase task accuracy?"
The experiments view supports aggregation over a number of metrics to compare differences in LLM performance on the same dataset. The below visualization shows exact match accuracy on SimpleQA for different models across experiments. We see that accuracy is impacted by the choice of underlying model (Claude Sonnet vs. OpenAI GPT-4o), as well as other factors such as updates to prompts and generation parameters.
Run an Experiment
You can kick off experiments and quickly start iterating on LLM performance with just one script execution. You can skip to the full code here.
1. Install Patronus Module
You can use our Python SDK to run batched evaluations and track experiments. If you prefer to run batched evaluations in a different language, follow our API reference guide.
2. Configure an API key
3. Set Environment Variables
If you do not have a Patronus API Key, see our quick start here for how to create one.
4. Collect the following inputs for evaluation
- Dataset: The inputs required for evaluation. You can load a dataset on Patronus in the following ways or use a Patronus hosted dataset.
- Uploading through the platform
- Uploading with data adaptors
- Using a Patronus dataset
- Task (optional): The task definition is needed when a dataset is does not contain AI outputs to be evaluated. For example, a dataset might contain user queries but not the generations of your AI system. While the output is typically generated by a LLM, it can be any part of your AI system, such as the retrieved contexts or processed user queries.
The task executes the workflow that we are testing. A task is a functional unit of work in your AI workflow. For example,- An LLM call in an agent execution
- A retriever that returns documents for a user query
- Text chunks from a PDF parsing library
- Results of a DB query
- Response from an external tool call
Here is an example showing a model call to return some output from the user queries. More here
-
Evaluators: The evaluation criteria used to assess our task.
5. Run an Experiment
Plug in the dataset, evaluator and task (optional) you defined in #4. Here is an example script where a custom task, evaluator, and a few dataset samples have already been defined.
Your experiment run will generate a summary report and a link to the platform within your code editor.
6. Compare Experiment Outputs
You can compare historical experiments in Comparisons to get useful insights. Run through this process to create comparisons against datasets, models, evaluators, and tasks. For example, we see that GPT-4o-mini performs worse than Claude 3.5 Sonnet in our experiments.
You can view row-wise output differences as well! This highlights differences in LLM outputs across runs. For example, in this case GPT-4o-mini returned the wrong answer, but Claude 3.5 answered correctly.