In this tutorial, we will run an experiment evaluating GPT-4o-mini on the FinanceBench dataset. The open source subset of FinanceBench is supported natively in the Patronus platform. You can view it in the Datasets tab:
This cookbook assumes you have already installed the patronus client, and have set the PATRONUS_API_KEY environment variable. You also need to provide OPENAI_API_KEY in your environment for this tutorial to query candidate models, but you can use an alternative LLM.
First, let's define a task to call GPT-4o-mini:
Here we are providing GPT-4o-mini with the question and context from the FinanceBench dataset. We will assess if the response matches the gold answer.
Since the FinanceBench dataset is supported in the Patronus platform, we can load it remotely as follows:
Now we need to select an evaluation metric. Since some of the gold answers contain longer responses, we may want to check for similarity in meaning as opposed to exact match. The fuzzy-match LLM judge is better suited for this task, because it's more similar to how a human would score the responses.
We are now ready to run our experiment!
When you run an experiment in the console, you will see an output of the aggregate statistics in the console:
Aggregate statistics can be viewed in the UI. Each experiment view shows individual rows and scores, along with the aggregate statistics for the dataset. Here, we see that GPT-4o-mini got 52% of responses correct on FinanceBench according to our LLM judge evaluator.