Aggregate Statistics and Exporting Results
Experiments generate aggregate statistics automatically. This includes the following:
- Mean score
- Score ranges
- Distribution of scores eg. p50, p75
Let's run an experiment evaluating GPT-4o-mini on the FinanceBench dataset (in the Patronus platform). When you run an experiment in the console, you will see an output of the aggregate statistics in the console:
❯ python run_financebench.py
Preparing dataset... DONE
Preparing evaluators... DONE
=======================================================
Experiment Tutorial/FinanceBench Experiment-1731463168: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 150/150 [01:53<00:00, 1.32sample/s]
Summary: judge-small-2024-08-08:patronus:fuzzy-match
----------------------------------------------------
Count : 150
Pass rate : 0.52
Mean : 0.52
Min : 0.0
25% : 0.0
50% : 1.0
75% : 1.0
Max : 1.0
Score distribution
Score Range Count Histogram
0.00 - 0.20 72 ##################
0.20 - 0.40 0
0.40 - 0.60 0
0.60 - 0.80 0
0.80 - 1.00 78 ####################
View results in the UI
Aggregate statistics can be viewed in the UI. Each experiment view shows individual rows and scores, along with the aggregate statistics for the dataset. Here, we see that GPT-4o-mini got 52% of responses correct on FinanceBench according to our LLM judge evaluator.
Exporting results
To export experiment results to a Pandas dataframe, you can call to_dataframe()
on the experiment results. For example:
results = cli.experiment(
"Tutorial",
data=financebench_dataset,
task=call_gpt,
evaluators=[fuzzy_match],
tags={"dataset_type": "financebench", "model": "gpt_4o_mini"},
experiment_name="FinanceBench Experiment",
)
df = results.to_dataframe()
We can inspect the dataframe and see the original dataset fields, along with eval results:
> df.head()
link_idx evaluator_id criteria pass ... meta_evaluated_model_name meta_evaluated_model_provider meta_evaluated_model_selected_model meta_evaluated_model_params
sid ...
1 0 judge-small-2024-08-08 patronus:fuzzy-match True ... None None None None
2 0 judge-small-2024-08-08 patronus:fuzzy-match True ... None None None None
3 0 judge-small-2024-08-08 patronus:fuzzy-match False ... None None None None
4 0 judge-small-2024-08-08 patronus:fuzzy-match False ... None None None None
5 0 judge-small-2024-08-08 patronus:fuzzy-match False ... None None None None
[5 rows x 22 columns]
You can export these results or continue to perform your own analyses on the dataframe!
To save the results locally,
# .csv export
df.to_csv("/path/to/results.csv")
# .jsonl export
df.to_json("/path/to/results.jsonl", orient="records", lines=True)
# .json export
df.to_json("/path/to/results.json", orient="records")
Full Code
Full code for the FinanceBench experiment run used in this example below:
from openai import OpenAI
from patronus import Client, task, Row, TaskResult, evaluator
oai = OpenAI()
cli = Client()
@task
def call_gpt(evaluated_model_input: str, evaluated_model_retrieved_context: list[str]) -> str:
model = "gpt-4o-mini"
evaluated_model_output = (
oai.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": "You are a helpful assistant. You must respond with only the answer. Do not provide an explanation."},
{
"role": "user",
"content": f"Question: {evaluated_model_input}\n\Context: {evaluated_model_retrieved_context}"
},
],
)
.choices[0]
.message.content
)
return evaluated_model_output
financebench_dataset = cli.remote_dataset("financebench")
fuzzy_match = cli.remote_evaluator("judge-small", "patronus:fuzzy-match")
# The framework will handle loading automatically when passed to an experiment
results = cli.experiment(
"Tutorial",
data=financebench_dataset,
task=call_gpt,
evaluators=[fuzzy_match],
tags={"dataset_type": "financebench", "model": "gpt_4o_mini"},
experiment_name="FinanceBench Experiment",
)
df = results.to_dataframe()
df.to_json("/path/to/results.json", orient="records")
Updated 26 days ago