Description
Cookbooks

Evaluating and Comparing LLMs

An experiment is an end to end batched evaluation that helps us answer questions such as "Is GPT-4o or Claude better for my task?", or "Does adding this sentence to my system prompt increase task accuracy?"

An experiment consists of several components:

  • Dataset: The inputs to the AI application that we want to test.
  • Task: The task executes the workflow that we are testing. An example task is to query GPT-4o-mini with a system and user prompt.
  • Evaluators: The criteria that we are evaluating on, such as similarity to the gold label.

1. Install Patronus Module

You can use our python SDK to run batched evaluations and track experiments. If you prefer to run batched evaluations in a different language, follow our API reference guide.

pip install patronus

2. Set Environment Variables

If you do not have a Patronus AI API Key, see our quick start here for how to create one. You only need an OpenAI API Key for this tutorial to evaluate ChatGPT.

export PATRONUSAI_API_KEY=<YOUR_API_KEY>
export OPENAI_API_KEY=<YOUR_OPENAI_KEY>

3. Run an Experiment

Let's run a simple experiment that quizzes GPT-4o-mini on some science questions and checks if the output matches the correct answer.

from openai import OpenAI
from patronus import Client, task, simple_evaluator
 
oai = OpenAI()
cli = Client()
 
data = [
  {
    "evaluated_model_input": "Which cell is closely associated with phagocytosis?",
    "evaluated_model_gold_answer": "Neutrophilis",
  },
  {
    "evaluated_model_input": "What do you call goods when the cross-price elasticity of demand is negative?",
    "evaluated_model_gold_answer": "Complements",
  },
]
 
@task
def call_gpt(evaluated_model_input: str) -> str:
    model = "gpt-4o-mini"
    evaluated_model_output = (
        oai.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": "You are a helpful assistant. You must respond with only the answer. Do not provide an explanation."},
                {"role": "user", "content": evaluated_model_input},
            ],
        )
        .choices[0]
        .message.content
    )
    return evaluated_model_output
 
exact_match = simple_evaluator(lambda output, gold_answer: output == gold_answer)
 
cli.experiment(
    "Tutorial",
    data=data,
    task=call_gpt,
    evaluators=[exact_match],
    tags={"dataset_type": "science_questions", "model": "gpt_4o_mini"},
    experiment_name="GPT-4o-mini Experiment",
)

Here we have defined a few concepts

  • A Dataset containing an array of inputs and gold answers that is used in our evaluation of our LLM system
  • The Task is to query GPT-4o-mini with inputs from our dataset
  • A simple Evaluator that checks whether the output matches the gold answer

4. View Experiment Results

Your experiment run should've generated a link to the platform. You can view experiment results in the Logs UI.

We see that GPT-4o-mini got both questions incorrect! Let's see if we can improve the LLM performance.

5. Run another experiment

We will run the same experiment with GPT-4o and see if performance is different. To do this, simply swap out the model in the @task definition. Remember to update tags and experiment_name if you are tracking the model version in metadata.

We see that GPT-4o got the second question correct.

6. Compare experiments

Let's see aggregate stats and evaluation summaries in the Comparisons view. We can compare performance of different LLMs, prompts, datasets and more.

Experiments allow us to easily optimize the performance of our AI applications. This is just the beginning. You can get creative with experiments by following our Experimentation Framework guide.

On this page