Aggregate Statistics and Exporting Results

Experiments generate aggregate statistics automatically. This includes the following:

  • Mean score
  • Score ranges
  • Distribution of scores eg. p50, p75

Let's run an experiment evaluating GPT-4o-mini on the FinanceBench dataset (in the Patronus platform). When you run an experiment in the console, you will see an output of the aggregate statistics in the console:

❯ python run_financebench.py
Preparing dataset... DONE
Preparing evaluators... DONE
=======================================================
Experiment  Tutorial/FinanceBench Experiment-1731463168: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 150/150 [01:53<00:00,  1.32sample/s]

Summary: judge-small-2024-08-08:patronus:fuzzy-match
----------------------------------------------------
Count     : 150
Pass rate : 0.52
Mean      : 0.52
Min       : 0.0
25%       : 0.0
50%       : 1.0
75%       : 1.0
Max       : 1.0

Score distribution
Score Range          Count      Histogram
0.00 - 0.20          72         ##################
0.20 - 0.40          0          
0.40 - 0.60          0          
0.60 - 0.80          0          
0.80 - 1.00          78         ####################         

View results in the UI

Aggregate statistics can be viewed in the UI. Each experiment view shows individual rows and scores, along with the aggregate statistics for the dataset. Here, we see that GPT-4o-mini got 52% of responses correct on FinanceBench according to our LLM judge evaluator.


Exporting results

To export experiment results to a Pandas dataframe, you can call to_dataframe() on the experiment results. For example:

results = cli.experiment(
    "Tutorial",
    data=financebench_dataset,
    task=call_gpt,
    evaluators=[fuzzy_match],
    tags={"dataset_type": "financebench", "model": "gpt_4o_mini"},
    experiment_name="FinanceBench Experiment",
)
df = results.to_dataframe()

We can inspect the dataframe and see the original dataset fields, along with eval results:

> df.head()
     link_idx            evaluator_id              criteria   pass  ...  meta_evaluated_model_name meta_evaluated_model_provider meta_evaluated_model_selected_model meta_evaluated_model_params
sid                                                                 ...                                                                                                                         
1           0  judge-small-2024-08-08  patronus:fuzzy-match   True  ...                       None                          None                                None                        None
2           0  judge-small-2024-08-08  patronus:fuzzy-match   True  ...                       None                          None                                None                        None
3           0  judge-small-2024-08-08  patronus:fuzzy-match  False  ...                       None                          None                                None                        None
4           0  judge-small-2024-08-08  patronus:fuzzy-match  False  ...                       None                          None                                None                        None
5           0  judge-small-2024-08-08  patronus:fuzzy-match  False  ...                       None                          None                                None                        None

[5 rows x 22 columns]

You can export these results or continue to perform your own analyses on the dataframe!

To save the results locally,

# .csv export
df.to_csv("/path/to/results.csv")

# .jsonl export
df.to_json("/path/to/results.jsonl", orient="records", lines=True)

# .json export
df.to_json("/path/to/results.json", orient="records")

Full Code

Full code for the FinanceBench experiment run used in this example below:

from openai import OpenAI
from patronus import Client, task, Row, TaskResult, evaluator

oai = OpenAI()
cli = Client()

@task
def call_gpt(evaluated_model_input: str, evaluated_model_retrieved_context: list[str]) -> str:
    model = "gpt-4o-mini"
    evaluated_model_output = (
        oai.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": "You are a helpful assistant. You must respond with only the answer. Do not provide an explanation."},
                {
                    "role": "user",
                    "content": f"Question: {evaluated_model_input}\n\Context: {evaluated_model_retrieved_context}"
                },
            ],
        )
        .choices[0]
        .message.content
    )
    return evaluated_model_output

financebench_dataset = cli.remote_dataset("financebench")
fuzzy_match = cli.remote_evaluator("judge-small", "patronus:fuzzy-match")

# The framework will handle loading automatically when passed to an experiment
results = cli.experiment(
    "Tutorial",
    data=financebench_dataset,
    task=call_gpt,
    evaluators=[fuzzy_match],
    tags={"dataset_type": "financebench", "model": "gpt_4o_mini"},
    experiment_name="FinanceBench Experiment",
)
df = results.to_dataframe()
df.to_json("/path/to/results.json", orient="records")