Evaluating and Comparing LLMs

An experiment is an end to end batched evaluation that helps us answer questions such as "Is GPT-4o or Claude better for my task?", or "Does adding this sentence to my system prompt increase task accuracy?"

An experiment consists of several components:

Dataset: The inputs to the AI application that we want to test.
Task: The task executes the workflow that we are testing. An example task is to query GPT-4o-mini with a system and user prompt.
Evaluators: The criteria that we are evaluating on, such as similarity to the gold label.

1. Install Patronus Module

You can use our python SDK to run batched evaluations and track experiments. If you prefer to run batched evaluations in a different language, follow our API reference guide.

pip install patronus

2. Set Environment Variables

If you do not have a Patronus AI API Key, see our quick start here for how to create one. You only need an OpenAI API Key for this tutorial to evaluate ChatGPT.

 
export PATRONUSAI_API_KEY=<YOUR_API_KEY>
export OPENAI_API_KEY=<YOUR_OPENAI_KEY>

3. Run an Experiment

Let's run a simple experiment that quizzes GPT-4o-mini on some science questions and checks if the output matches the correct answer.

Python

from openai import OpenAI
from patronus import Client, task, simple_evaluator
 
oai = OpenAI()
cli = Client()
 
data = [
  {
    "evaluated_model_input": "Which cell is closely associated with phagocytosis?",
    "evaluated_model_gold_answer": "Neutrophilis",
  },
  {
    "evaluated_model_input": "What do you call goods when the cross-price elasticity of demand is negative?",
    "evaluated_model_gold_answer": "Complements",
  },
]
 
@task
def call_gpt(evaluated_model_input: str) -> str:
    model = "gpt-4o-mini"
    evaluated_model_output = (
        oai.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": "You are a helpful assistant. You must respond with only the answer. Do not provide an explanation."},
                {"role": "user", "content": evaluated_model_input},
            ],
        )
        .choices[0]
        .message.content
    )
    return evaluated_model_output
 
exact_match = simple_evaluator(lambda output, gold_answer: output == gold_answer)
 
cli.experiment(
    "Tutorial",
    data=data,
    task=call_gpt,
    evaluators=[exact_match],
    tags={"dataset_type": "science_questions", "model": "gpt_4o_mini"},
    experiment_name="GPT-4o-mini Experiment",
)

Here we have defined a few concepts

A Dataset containing an array of inputs and gold answers that is used in our evaluation of our LLM system
The Task is to query GPT-4o-mini with inputs from our dataset
A simple Evaluator that checks whether the output matches the gold answer

Evaluating and Comparing LLMs

1. Install Patronus Module

2. Set Environment Variables

3. Run an Experiment

4. View Experiment Results

5. Run another experiment

6. Compare experiments

On this page