Evaluating and Comparing LLMs
An experiment is an end to end batched evaluation that helps us answer questions such as "Is GPT-4o or Claude better for my task?", or "Does adding this sentence to my system prompt increase task accuracy?"
An experiment consists of several components:
- Dataset: The inputs to the AI application that we want to test.
- Task: The task executes the workflow that we are testing. An example task is to query GPT-4o-mini with a system and user prompt.
- Evaluators: The criteria that we are evaluating on, such as similarity to the gold label.
1. Install Patronus Module
You can use our python SDK to run batched evaluations and track experiments. If you prefer to run batched evaluations in a different language, follow our API reference guide.
2. Set Environment Variables
If you do not have a Patronus AI API Key, see our quick start here for how to create one. You only need an OpenAI API Key for this tutorial to evaluate ChatGPT.
3. Run an Experiment
Let's run a simple experiment that quizzes GPT-4o-mini on some science questions and checks if the output matches the correct answer.
Here we have defined a few concepts
- A Dataset containing an array of inputs and gold answers that is used in our evaluation of our LLM system
- The Task is to query GPT-4o-mini with inputs from our dataset
- A simple Evaluator that checks whether the output matches the gold answer
4. View Experiment Results
Your experiment run should've generated a link to the platform. You can view experiment results in the Logs UI.
We see that GPT-4o-mini got both questions incorrect! Let's see if we can improve the LLM performance.
5. Run another experiment
We will run the same experiment with GPT-4o and see if performance is different. To do this, simply swap out the model in the @task definition. Remember to update tags
and experiment_name
if you are tracking the model version in metadata.
We see that GPT-4o got the second question correct.
6. Compare experiments
Let's see aggregate stats and evaluation summaries in the Comparisons view. We can compare performance of different LLMs, prompts, datasets and more.
Experiments allow us to easily optimize the performance of our AI applications. This is just the beginning. You can get creative with experiments by following our Experimentation Framework guide.