Experiments generate aggregate statistics automatically. This includes the following:
Mean score
Score ranges
Distribution of scores eg. p50, p75
Running an example experiment evaluating GPT-4o-mini on the FinanceBench dataset you will see an output of the aggregate statistics in the console:
Click to expand full experiment code Python
from openai import OpenAI
from patronus.datasets import RemoteDatasetLoader
from patronus.evals import RemoteEvaluator
from patronus.experiments import run_experiment
oai = OpenAI()
def call_gpt (row, ** kwargs):
model = "gpt-4o-mini"
# Build the prompt with context
context = " \n " .join(row.task_context) if isinstance (row.task_context, list ) else row.task_context
prompt = f "Question: { row.task_input }\n\n Context: { context } "
response = oai.chat.completions.create(
model = model,
messages = [
{ "role" : "system" , "content" : "You are a helpful assistant. You must respond with only the answer. Do not provide an explanation." },
{ "role" : "user" , "content" : prompt}
],
)
return response.choices[ 0 ].message.content
# Load dataset from Patronus platform
financebench_dataset = RemoteDatasetLoader( "financebench" )
# Run the experiment
results = run_experiment(
dataset = financebench_dataset,
task = call_gpt,
evaluators = [
RemoteEvaluator( "judge" , "patronus:fuzzy-match" )
],
tags = { "dataset_type" : "financebench" , "model" : "gpt-4o-mini" },
project_name = "Tutorial" ,
experiment_name = "FinanceBench Experiment"
)
# Export results
df = results.to_dataframe()
df.to_json( "/path/to/results.json" , orient = "records" ) For more on how this experiment works, see the running experiments guide .
Terminal
❯ python run_financebench.py
Preparing dataset... DONE
Preparing evaluators... DONE
=======================================================
Experiment Tutorial/FinanceBench Experiment-1731463168: 100% | ████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████ | 150/150 [01:53<00:00, 1.32sample/s]
Summary: judge-small-2024-08-08:patronus:fuzzy-match
----------------------------------------------------
Count : 150
Pass rate : 0.52
Mean : 0.52
Min : 0.0
25% : 0.0
50% : 1.0
75% : 1.0
Max : 1.0
Score distribution
Score Range Count Histogram
0.00 - 0.20 72 ##################
0.20 - 0.40 0
0.40 - 0.60 0
0.60 - 0.80 0
0.80 - 1.00 78 ####################
Aggregate statistics can be viewed in the UI. Each experiment view shows individual rows and scores, along with the aggregate statistics for the dataset. Here, we see that GPT-4o-mini got 52% of responses correct on FinanceBench according to our LLM judge evaluator.
To export experiment results to a Pandas dataframe, you can call to_dataframe() on the experiment results. For example:
Python
# Dummy experiment
experiment = run_experiment(dataset, task, evaluators)
# Get a Pandas DataFrame
df = experiment.to_dataframe()
We can inspect the dataframe and see the original dataset fields, along with eval results:
Terminal
> df.head ()
link_idx evaluator_id criteria pass ... meta_evaluated_model_name meta_evaluated_model_provider meta_evaluated_model_selected_model meta_evaluated_model_params
sid ...
1 0 judge-small-2024-08-08 patronus:fuzzy-match True ... None None None None
2 0 judge-small-2024-08-08 patronus:fuzzy-match True ... None None None None
3 0 judge-small-2024-08-08 patronus:fuzzy-match False ... None None None None
4 0 judge-small-2024-08-08 patronus:fuzzy-match False ... None None None None
5 0 judge-small-2024-08-08 patronus:fuzzy-match False ... None None None None
[ 5 rows x 22 columns]
You can export these results or continue to perform your own analyses on the dataframe!
To save the results locally:
Python
# .csv export
df.to_csv( "/path/to/results.csv" )
# .jsonl export
df.to_json( "/path/to/results.jsonl" , orient = "records" , lines = True )
# .json export
df.to_json( "/path/to/results.json" , orient = "records" )