Our docs got a refresh! Check out the new content and improved navigation. For detailed API reference see our Python SDK docs and TypeScript SDK.
GuidesCookbooks
Evaluating LLMs on FinanceBench
In this tutorial, we will run an experiment evaluating GPT-4o-mini on the FinanceBench dataset. The open source subset of FinanceBench is supported natively in the Patronus platform. You can view it in the Datasets tab:
This cookbook assumes you have already installed the patronus SDK, and have set the PATRONUS_API_KEY environment variable. You also need to provide OPENAI_API_KEY in your environment for this tutorial to query candidate models, but you can use an alternative LLM.
First, let's define a task to call GPT-4o-mini:
from openai import OpenAIoai = OpenAI()def call_gpt(row, **kwargs): model = "gpt-4o-mini" # Build the prompt with context context = "\n".join(row.task_context) if isinstance(row.task_context, list) else row.task_context prompt = f"Question: {row.task_input}\n\nContext: {context}" response = oai.chat.completions.create( model=model, messages=[ {"role": "system", "content": "You are a helpful assistant. You must respond with only the answer. Do not provide an explanation."}, {"role": "user", "content": prompt} ], ) return response.choices[0].message.content
Here we are providing GPT-4o-mini with the question and context from the FinanceBench dataset. We will assess if the response matches the gold answer.
Since the FinanceBench dataset is supported in the Patronus platform, we can load it remotely as follows:
from patronus.datasets import RemoteDatasetLoaderfinancebench_dataset = RemoteDatasetLoader("financebench")
Now we need to select an evaluation metric. Since some of the gold answers contain longer responses, we may want to check for similarity in meaning as opposed to exact match. The fuzzy-match LLM judge is better suited for this task, because it's more similar to how a human would score the responses.
from patronus.evals import RemoteEvaluatorfuzzy_match = RemoteEvaluator("judge", "patronus:fuzzy-match")
Aggregate statistics can be viewed in the UI. Each experiment view shows individual rows and scores, along with the aggregate statistics for the dataset. Here, we see that GPT-4o-mini got 52% of responses correct on FinanceBench according to our LLM judge evaluator.