Aggregate Statistics and Exporting Results

Experiments generate aggregate statistics automatically. This includes the following:

Mean score
Score ranges
Distribution of scores eg. p50, p75

Running an example experiment evaluating GPT-4o-mini on the FinanceBench dataset you will see an output of the aggregate statistics in the console:

Click to expand full experiment code

from openai import OpenAI
from patronus.datasets import RemoteDatasetLoader
from patronus.evals import RemoteEvaluator
from patronus.experiments import run_experiment
 
oai = OpenAI()
 
def call_gpt(row, **kwargs):
    model = "gpt-4o-mini"
 
    # Build the prompt with context
    context = "\n".join(row.task_context) if isinstance(row.task_context, list) else row.task_context
    prompt = f"Question: {row.task_input}\n\nContext: {context}"
 
    response = oai.chat.completions.create(
        model=model,
        messages=[
            {"role": "system", "content": "You are a helpful assistant. You must respond with only the answer. Do not provide an explanation."},
            {"role": "user", "content": prompt}
        ],
    )
 
    return response.choices[0].message.content
 
# Load dataset from Patronus platform
financebench_dataset = RemoteDatasetLoader("financebench")
 
# Run the experiment
results = run_experiment(
    dataset=financebench_dataset,
    task=call_gpt,
    evaluators=[
        RemoteEvaluator("judge", "patronus:fuzzy-match")
    ],
    tags={"dataset_type": "financebench", "model": "gpt-4o-mini"},
    project_name="Tutorial",
    experiment_name="FinanceBench Experiment"
)
 
# Export results
df = results.to_dataframe()
df.to_json("/path/to/results.json", orient="records")

For more on how this experiment works, see the running experiments guide.

 
❯ python run_financebench.py
Preparing dataset... DONE
Preparing evaluators... DONE
=======================================================
Experiment  Tutorial/FinanceBench Experiment-1731463168: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 150/150 [01:53<00:00,  1.32sample/s]
 
Summary: judge-small-2024-08-08:patronus:fuzzy-match
----------------------------------------------------
Count     : 150
Pass rate : 0.52
Mean      : 0.52
Min       : 0.0
25%       : 0.0
50%       : 1.0
75%       : 1.0
Max       : 1.0
 
Score distribution
Score Range          Count      Histogram
0.00 - 0.20          72         ##################
0.20 - 0.40          0          
0.40 - 0.60          0          
0.60 - 0.80          0          
0.80 - 1.00          78         ####################

View Results in the UI

Aggregate statistics can be viewed in the UI. Each experiment view shows individual rows and scores, along with the aggregate statistics for the dataset. Here, we see that GPT-4o-mini got 52% of responses correct on FinanceBench according to our LLM judge evaluator.

Exporting Results

To export experiment results to a Pandas dataframe, you can call to_dataframe() on the experiment results. For example:

# Dummy experiment
experiment = run_experiment(dataset, task, evaluators) 
 
# Get a Pandas DataFrame
df = experiment.to_dataframe()

We can inspect the dataframe and see the original dataset fields, along with eval results:

> df.head()
     link_idx            evaluator_id              criteria   pass  ...  meta_evaluated_model_name meta_evaluated_model_provider meta_evaluated_model_selected_model meta_evaluated_model_params
sid                                                                 ...                                                                                                                         
1           0  judge-small-2024-08-08  patronus:fuzzy-match   True  ...                       None                          None                                None                        None
2           0  judge-small-2024-08-08  patronus:fuzzy-match   True  ...                       None                          None                                None                        None
3           0  judge-small-2024-08-08  patronus:fuzzy-match  False  ...                       None                          None                                None                        None
4           0  judge-small-2024-08-08  patronus:fuzzy-match  False  ...                       None                          None                                None                        None
5           0  judge-small-2024-08-08  patronus:fuzzy-match  False  ...                       None                          None                                None                        None
 
[5 rows x 22 columns]

You can export these results or continue to perform your own analyses on the dataframe!

To save the results locally:

# .csv export
df.to_csv("/path/to/results.csv")
 
# .jsonl export
df.to_json("/path/to/results.jsonl", orient="records", lines=True)
 
# .json export
df.to_json("/path/to/results.json", orient="records")

Aggregate Statistics and Exporting Results

View Results in the UI

Exporting Results

On this page