Experiments generate aggregate statistics automatically. This includes the following:
Mean score
Score ranges
Distribution of scores eg. p50, p75
Let's run an experiment evaluating GPT-4o-mini on the FinanceBench dataset (in the Patronus platform). When you run an experiment in the console, you will see an output of the aggregate statistics in the console:
❯ python run_financebench.py
Preparing dataset... DONE
Preparing evaluators... DONE
=======================================================
Experiment Tutorial/FinanceBench Experiment-1731463168: 100% | ████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 150/150 [01:53<00:00, 1.32sample/s]
Summary: judge-small-2024-08-08:patronus:fuzzy-match
----------------------------------------------------
Count : 150
Pass rate : 0.52
Mean : 0.52
Min : 0.0
25% : 0.0
50% : 1.0
75% : 1.0
Max : 1.0
Score distribution
Score Range Count Histogram
0.00 - 0.20 72 ##################
0.20 - 0.40 0
0.40 - 0.60 0
0.60 - 0.80 0
0.80 - 1.00 78 ####################
Aggregate statistics can be viewed in the UI. Each experiment view shows individual rows and scores, along with the aggregate statistics for the dataset. Here, we see that GPT-4o-mini got 52% of responses correct on FinanceBench according to our LLM judge evaluator.
To export experiment results to a Pandas dataframe, you can call to_dataframe()
on the experiment results. For example:
results = cli.experiment(
"Tutorial" ,
data = financebench_dataset,
task = call_gpt,
evaluators = [fuzzy_match],
tags = { "dataset_type" : "financebench" , "model" : "gpt_4o_mini" },
experiment_name = "FinanceBench Experiment" ,
)
df = results.to_dataframe()
We can inspect the dataframe and see the original dataset fields, along with eval results:
> df.head ()
link_idx evaluator_id criteria pass ... meta_evaluated_model_name meta_evaluated_model_provider meta_evaluated_model_selected_model meta_evaluated_model_params
sid ...
1 0 judge-small-2024-08-08 patronus:fuzzy-match True ... None None None None
2 0 judge-small-2024-08-08 patronus:fuzzy-match True ... None None None None
3 0 judge-small-2024-08-08 patronus:fuzzy-match False ... None None None None
4 0 judge-small-2024-08-08 patronus:fuzzy-match False ... None None None None
5 0 judge-small-2024-08-08 patronus:fuzzy-match False ... None None None None
[ 5 rows x 22 columns]
You can export these results or continue to perform your own analyses on the dataframe!
To save the results locally,
# .csv export
df.to_csv( "/path/to/results.csv" )
# .jsonl export
df.to_json( "/path/to/results.jsonl" , orient="records", lines=True )
# .json export
df.to_json( "/path/to/results.json" , orient="records" )
Full code for the FinanceBench experiment run used in this example below:
from openai import OpenAI
from patronus import Client, task, Row, TaskResult, evaluator
oai = OpenAI ()
cli = Client ()
# data = [
# {
# "evaluated_model_input": "Which cell is closely associated with phagocytosis?",
# "evaluated_model_gold_answer": "Neutrophilis",
# },
# {
# "evaluated_model_input": "What do you call goods when the cross-price elasticity of demand is negative?",
# "evaluated_model_gold_answer": "Complements",
# },
# ]
@task
def call_gpt ( evaluated_model_input: str, evaluated_model_retrieved_context: list[str] ) - > str:
model = "gpt-4o-mini"
evaluated_model_output = (
oai.chat.completions.create(
model = model,
messages = [
{ "role" : "system", "content": "You are a helpful assistant. You must respond with only the answer. Do not provide an explanation."},
{
"role" : "user",
"content" : f"Question: {evaluated_model_input}\n\Context: {evaluated_model_retrieved_context}"
},
],
)
.choices[0]
.message.content
)
return evaluated_model_output
financebench_dataset = cli.remote_dataset ( "financebench" )
fuzzy_match = cli.remote_evaluator ( "judge-small" , "patronus:fuzzy-match" )
# The framework will handle loading automatically when passed to an experiment
results = cli.experiment (
"Tutorial" ,
data = financebench_dataset,
task = call_gpt,
evaluators = [fuzzy_match ],
tags = { "dataset_type" : "financebench", "model": "gpt_4o_mini"},
experiment_name = "FinanceBench Experiment",
)
df = results.to_dataframe ()
df.to_json( "/path/to/results.json" , orient="records" )