Evaluating LLMs on FinanceBench

In this tutorial, we will run an experiment evaluating GPT-4o-mini on the FinanceBench dataset. The open source subset of FinanceBench is supported natively in the Patronus platform. You can view it in the Datasets tab:


This cookbook assumes you have already installed the patronus client, and have set the PATRONUS_API_KEY environment variable. You also need to provide OPENAI_API_KEY in your environment for this tutorial to query candidate models, but you can use an alternative LLM.

First, let's define a task to call GPT-4o-mini:

from openai import OpenAI
from patronus import Client, task, Row, TaskResult, evaluator

oai = OpenAI()
cli = Client()

@task
def call_gpt(evaluated_model_input: str, evaluated_model_retrieved_context: list[str]) -> str:
    model = "gpt-4o-mini"
    evaluated_model_output = (
        oai.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": "You are a helpful assistant. You must respond with only the answer. Do not provide an explanation."},
                {
                    "role": "user",
                    "content": f"Question: {evaluated_model_input}\n\Context: {evaluated_model_retrieved_context}"
                },
            ],
        )
        .choices[0]
        .message.content
    )
    return evaluated_model_output

Here we are providing GPT-4o-mini with the question and context from the FinanceBench dataset. We will assess if the response matches the gold answer.

Since the FinanceBench dataset is supported in the Patronus platform, we can load it remotely as follows:

financebench_dataset = cli.remote_dataset("financebench")

Now we need to select an evaluation metric. Since some of the gold answers contain longer responses, we may want to check for similarity in meaning as opposed to exact match. The fuzzy-match LLM judge is better suited for this task, because it's more similar to how a human would score the responses.

fuzzy_match = cli.remote_evaluator("judge-small", "patronus:fuzzy-match")

We are now ready to run our experiment!

results = cli.experiment(
    "Tutorial",
    data=financebench_dataset,
    task=call_gpt,
    evaluators=[fuzzy_match],
    tags={"dataset_type": "financebench", "model": "gpt_4o_mini"},
    experiment_name="FinanceBench Experiment",
)
df = results.to_dataframe()

When you run an experiment in the console, you will see an output of the aggregate statistics in the console:

❯ python run_financebench.py
Preparing dataset... DONE
Preparing evaluators... DONE
=======================================================
Experiment  Tutorial/FinanceBench Experiment-1731463168: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 150/150 [01:53<00:00,  1.32sample/s]

Summary: judge-small-2024-08-08:patronus:fuzzy-match
----------------------------------------------------
Count     : 150
Pass rate : 0.52
Mean      : 0.52
Min       : 0.0
25%       : 0.0
50%       : 1.0
75%       : 1.0
Max       : 1.0

Score distribution
Score Range          Count      Histogram
0.00 - 0.20          72         ##################
0.20 - 0.40          0          
0.40 - 0.60          0          
0.60 - 0.80          0          
0.80 - 1.00          78         ####################         

View results in the UI

Aggregate statistics can be viewed in the UI. Each experiment view shows individual rows and scores, along with the aggregate statistics for the dataset. Here, we see that GPT-4o-mini got 52% of responses correct on FinanceBench according to our LLM judge evaluator.

Exporting results

To export experiment results to a Pandas dataframe, you can call to_dataframe() on the experiment results. For example:

results = cli.experiment(
    "Tutorial",
    data=financebench_dataset,
    task=call_gpt,
    evaluators=[fuzzy_match],
    tags={"dataset_type": "financebench", "model": "gpt_4o_mini"},
    experiment_name="FinanceBench Experiment",
)
df = results.to_dataframe()

We can inspect the dataframe and see the original dataset fields, along with eval results:

> df.head()
     link_idx            evaluator_id              criteria   pass  ...  meta_evaluated_model_name meta_evaluated_model_provider meta_evaluated_model_selected_model meta_evaluated_model_params
sid                                                                 ...                                                                                                                         
1           0  judge-small-2024-08-08  patronus:fuzzy-match   True  ...                       None                          None                                None                        None
2           0  judge-small-2024-08-08  patronus:fuzzy-match   True  ...                       None                          None                                None                        None
3           0  judge-small-2024-08-08  patronus:fuzzy-match  False  ...                       None                          None                                None                        None
4           0  judge-small-2024-08-08  patronus:fuzzy-match  False  ...                       None                          None                                None                        None
5           0  judge-small-2024-08-08  patronus:fuzzy-match  False  ...                       None                          None                                None                        None

[5 rows x 22 columns]

You can export these results or continue to perform your own analyses on the dataframe!

To save the results locally,

# .csv export
df.to_csv("/path/to/results.csv")

# .jsonl export
df.to_json("/path/to/results.jsonl", orient="records", lines=True)

# .json export
df.to_json("/path/to/results.json", orient="records")

Full Code

Full code for the FinanceBench experiment run used in this example below:

from openai import OpenAI
from patronus import Client, task, Row, TaskResult, evaluator

oai = OpenAI()
cli = Client()

@task
def call_gpt(evaluated_model_input: str, evaluated_model_retrieved_context: list[str]) -> str:
    model = "gpt-4o-mini"
    evaluated_model_output = (
        oai.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": "You are a helpful assistant. You must respond with only the answer. Do not provide an explanation."},
                {
                    "role": "user",
                    "content": f"Question: {evaluated_model_input}\nContext: {evaluated_model_retrieved_context}"
                },
            ],
        )
        .choices[0]
        .message.content
    )
    return evaluated_model_output

financebench_dataset = cli.remote_dataset("financebench")
fuzzy_match = cli.remote_evaluator("judge-small", "patronus:fuzzy-match")

# The framework will handle loading automatically when passed to an experiment
results = cli.experiment(
    "Tutorial",
    data=financebench_dataset,
    task=call_gpt,
    evaluators=[fuzzy_match],
    tags={"dataset_type": "financebench", "model": "gpt_4o_mini"},
    experiment_name="FinanceBench Experiment",
)
df = results.to_dataframe()
df.to_json("/path/to/results.json", orient="records")