Experiments generate aggregate statistics automatically. This includes the following:
Mean score
Score ranges
Distribution of scores eg. p50, p75
Let's run an experiment evaluating GPT-4o-mini on the FinanceBench dataset (in the Patronus platform). When you run an experiment in the console, you will see an output of the aggregate statistics in the console:
Terminal
❯ python run_financebench.py
Preparing dataset... DONE
Preparing evaluators... DONE
=======================================================
Experiment Tutorial/FinanceBench Experiment-1731463168: 100% | ████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 150/150 [01:53<00:00, 1.32sample/s]
Summary: judge-small-2024-08-08:patronus:fuzzy-match
----------------------------------------------------
Count : 150
Pass rate : 0.52
Mean : 0.52
Min : 0.0
25% : 0.0
50% : 1.0
75% : 1.0
Max : 1.0
Score distribution
Score Range Count Histogram
0.00 - 0.20 72 ##################
0.20 - 0.40 0
0.40 - 0.60 0
0.60 - 0.80 0
0.80 - 1.00 78 ####################
Aggregate statistics can be viewed in the UI. Each experiment view shows individual rows and scores, along with the aggregate statistics for the dataset. Here, we see that GPT-4o-mini got 52% of responses correct on FinanceBench according to our LLM judge evaluator.
To export experiment results to a Pandas dataframe, you can call to_dataframe()
on the experiment results. For example:
Python
results = cli.experiment(
"Tutorial" ,
data = financebench_dataset,
task = call_gpt,
evaluators = [fuzzy_match],
tags = { "dataset_type" : "financebench" , "model" : "gpt_4o_mini" },
experiment_name = "FinanceBench Experiment" ,
)
df = results.to_dataframe()
We can inspect the dataframe and see the original dataset fields, along with eval results:
Terminal
> df.head ()
link_idx evaluator_id criteria pass ... meta_evaluated_model_name meta_evaluated_model_provider meta_evaluated_model_selected_model meta_evaluated_model_params
sid ...
1 0 judge-small-2024-08-08 patronus:fuzzy-match True ... None None None None
2 0 judge-small-2024-08-08 patronus:fuzzy-match True ... None None None None
3 0 judge-small-2024-08-08 patronus:fuzzy-match False ... None None None None
4 0 judge-small-2024-08-08 patronus:fuzzy-match False ... None None None None
5 0 judge-small-2024-08-08 patronus:fuzzy-match False ... None None None None
[ 5 rows x 22 columns]
You can export these results or continue to perform your own analyses on the dataframe!
To save the results locally,
Python
# .csv export
df.to_csv( "/path/to/results.csv" )
# .jsonl export
df.to_json( "/path/to/results.jsonl" , orient = "records" , lines = True )
# .json export
df.to_json( "/path/to/results.json" , orient = "records" )
Full code for the FinanceBench experiment run used in this example below:
Python
from openai import OpenAI
from patronus import Client, task, Row, TaskResult, evaluator
oai = OpenAI()
cli = Client()
@task
def call_gpt (evaluated_model_input: str , evaluated_model_retrieved_context: list[ str ]) -> str :
model = "gpt-4o-mini"
evaluated_model_output = (
oai.chat.completions.create(
model = model,
messages = [
{ "role" : "system" , "content" : "You are a helpful assistant. You must respond with only the answer. Do not provide an explanation." },
{
"role" : "user" ,
"content" : f "Question: { evaluated_model_input }\n \Context: { evaluated_model_retrieved_context } "
},
],
)
.choices[ 0 ]
.message.content
)
return evaluated_model_output
financebench_dataset = cli.remote_dataset( "financebench" )
fuzzy_match = cli.remote_evaluator( "judge-small" , "patronus:fuzzy-match" )
# The framework will handle loading automatically when passed to an experiment
results = cli.experiment(
"Tutorial" ,
data = financebench_dataset,
task = call_gpt,
evaluators = [fuzzy_match],
tags = { "dataset_type" : "financebench" , "model" : "gpt_4o_mini" },
experiment_name = "FinanceBench Experiment" ,
)
df = results.to_dataframe()
df.to_json( "/path/to/results.json" , orient = "records" )