Evaluating LLMs on FinanceBench
In this tutorial, we will run an experiment evaluating GPT-4o-mini on the FinanceBench dataset. The open source subset of FinanceBench is supported natively in the Patronus platform. You can view it in the Datasets tab:
This cookbook assumes you have already installed the patronus
client, and have set the PATRONUS_API_KEY
environment variable. You also need to provide OPENAI_API_KEY
in your environment for this tutorial to query candidate models, but you can use an alternative LLM.
First, let's define a task to call GPT-4o-mini:
from openai import OpenAI
from patronus import Client, task, Row, TaskResult, evaluator
oai = OpenAI()
cli = Client()
@task
def call_gpt(evaluated_model_input: str, evaluated_model_retrieved_context: list[str]) -> str:
model = "gpt-4o-mini"
evaluated_model_output = (
oai.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": "You are a helpful assistant. You must respond with only the answer. Do not provide an explanation."},
{
"role": "user",
"content": f"Question: {evaluated_model_input}\n\Context: {evaluated_model_retrieved_context}"
},
],
)
.choices[0]
.message.content
)
return evaluated_model_output
Here we are providing GPT-4o-mini with the question and context from the FinanceBench dataset. We will assess if the response matches the gold answer.
Since the FinanceBench dataset is supported in the Patronus platform, we can load it remotely as follows:
financebench_dataset = cli.remote_dataset("financebench")
Now we need to select an evaluation metric. Since some of the gold answers contain longer responses, we may want to check for similarity in meaning as opposed to exact match. The fuzzy-match
LLM judge is better suited for this task, because it's more similar to how a human would score the responses.
fuzzy_match = cli.remote_evaluator("judge-small", "patronus:fuzzy-match")
We are now ready to run our experiment!
results = cli.experiment(
"Tutorial",
data=financebench_dataset,
task=call_gpt,
evaluators=[fuzzy_match],
tags={"dataset_type": "financebench", "model": "gpt_4o_mini"},
experiment_name="FinanceBench Experiment",
)
df = results.to_dataframe()
When you run an experiment in the console, you will see an output of the aggregate statistics in the console:
❯ python run_financebench.py
Preparing dataset... DONE
Preparing evaluators... DONE
=======================================================
Experiment Tutorial/FinanceBench Experiment-1731463168: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 150/150 [01:53<00:00, 1.32sample/s]
Summary: judge-small-2024-08-08:patronus:fuzzy-match
----------------------------------------------------
Count : 150
Pass rate : 0.52
Mean : 0.52
Min : 0.0
25% : 0.0
50% : 1.0
75% : 1.0
Max : 1.0
Score distribution
Score Range Count Histogram
0.00 - 0.20 72 ##################
0.20 - 0.40 0
0.40 - 0.60 0
0.60 - 0.80 0
0.80 - 1.00 78 ####################
View results in the UI
Aggregate statistics can be viewed in the UI. Each experiment view shows individual rows and scores, along with the aggregate statistics for the dataset. Here, we see that GPT-4o-mini got 52% of responses correct on FinanceBench according to our LLM judge evaluator.
Exporting results
To export experiment results to a Pandas dataframe, you can call to_dataframe()
on the experiment results. For example:
results = cli.experiment(
"Tutorial",
data=financebench_dataset,
task=call_gpt,
evaluators=[fuzzy_match],
tags={"dataset_type": "financebench", "model": "gpt_4o_mini"},
experiment_name="FinanceBench Experiment",
)
df = results.to_dataframe()
We can inspect the dataframe and see the original dataset fields, along with eval results:
> df.head()
link_idx evaluator_id criteria pass ... meta_evaluated_model_name meta_evaluated_model_provider meta_evaluated_model_selected_model meta_evaluated_model_params
sid ...
1 0 judge-small-2024-08-08 patronus:fuzzy-match True ... None None None None
2 0 judge-small-2024-08-08 patronus:fuzzy-match True ... None None None None
3 0 judge-small-2024-08-08 patronus:fuzzy-match False ... None None None None
4 0 judge-small-2024-08-08 patronus:fuzzy-match False ... None None None None
5 0 judge-small-2024-08-08 patronus:fuzzy-match False ... None None None None
[5 rows x 22 columns]
You can export these results or continue to perform your own analyses on the dataframe!
To save the results locally,
# .csv export
df.to_csv("/path/to/results.csv")
# .jsonl export
df.to_json("/path/to/results.jsonl", orient="records", lines=True)
# .json export
df.to_json("/path/to/results.json", orient="records")
Full Code
Full code for the FinanceBench experiment run used in this example below:
from openai import OpenAI
from patronus import Client, task, Row, TaskResult, evaluator
oai = OpenAI()
cli = Client()
@task
def call_gpt(evaluated_model_input: str, evaluated_model_retrieved_context: list[str]) -> str:
model = "gpt-4o-mini"
evaluated_model_output = (
oai.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": "You are a helpful assistant. You must respond with only the answer. Do not provide an explanation."},
{
"role": "user",
"content": f"Question: {evaluated_model_input}\nContext: {evaluated_model_retrieved_context}"
},
],
)
.choices[0]
.message.content
)
return evaluated_model_output
financebench_dataset = cli.remote_dataset("financebench")
fuzzy_match = cli.remote_evaluator("judge-small", "patronus:fuzzy-match")
# The framework will handle loading automatically when passed to an experiment
results = cli.experiment(
"Tutorial",
data=financebench_dataset,
task=call_gpt,
evaluators=[fuzzy_match],
tags={"dataset_type": "financebench", "model": "gpt_4o_mini"},
experiment_name="FinanceBench Experiment",
)
df = results.to_dataframe()
df.to_json("/path/to/results.json", orient="records")
Updated 19 days ago