Function Based Evaluators
The fastest way to define an evaluator is to create a function with the @evaluator
decorator in our Python SDK. This is recommended for simple heuristic-based evals, such as
- Schema validation on structured outputs
- Regex and string matching
- Length checks
Let's define a simple evaluator that will compare the model output to the gold answer. (This evaluator is case insensitive and ignores leading and trailing whitespaces.)
from patronus import evaluator, Row
@evaluator
def iexact_match(row: Row) -> bool:
return row.evaluated_model_output.lower().strip() == row.evaluated_model_gold_answer.lower().strip()
In the example above, our evaluator returns a boolean value. The framework will automatically convert this into an EvaluationResult
object. Alternatively, you can return an EvaluationResult
in the function:
from patronus import evaluator, Row, EvaluationResult
@evaluator
def iexact_match(row: Row) -> bool:
pass_result = row.evaluated_model_output.lower().strip() == row.evaluated_model_gold_answer.lower().strip()
score = 1 if pass_result else 0
return EvaluationResult(
score_raw=score,
pass_=pass_result,
)
Evaluator Function Inputs
An evaluator function accepts the following inputs.
Field | Type | Description |
---|---|---|
row | patronus.Row | Object containing data needed for evaluation, such as model inputs and outputs. Rows do not have enforced schemas and can therefore be used with any data model. |
task_result | patronus.TaskResult | The result object returned by the task execution. This is not required for experiments that do not use tasks (i.e. the dataset contains all the information needed for the eval). |
evaluated_model_input | str | Input provided to the system to be evaluated. |
evaluated_model_output | str | The output generated by the system to be evaluated. |
evaluated_model_retrieved_context | list[str] | A list of context strings retrieved and provided to the model as additional information. This is typically used in a Retrieval-Augmented Generation (RAG) setup, where the model's response depends on external context or supporting information that has been fetched from a knowledge base or similar source. |
evaluated_model_gold_answer | str | The expected or correct answer that the model output is compared against during evaluation. |
The framework will automatically inject the appropriate values based on the parameter names you specify.
Evaluator functions can return any of the following outputs:
bool
: A boolean value indicating whether the instance passes or fails the evaluation.EvaluationResult
: AnEvaluationResult
consists of the following fieldsscore_raw
: raw score definingpass_
: A boolean indicating whether the score meets the pass threshold.tags
: A dictionary that can store additional metadata as key-value pairs, such as the threshold used during the evaluation.
Full Code Example (Experiment)
Below is a complete example demonstrating how to use the iexact_match
evaluator in an experiment.
from patronus import Client, evaluator, Row
client = Client()
@evaluator
def iexact_match(row: Row) -> bool:
return row.evaluated_model_output.lower().strip() == row.evaluated_model_gold_answer.lower().strip()
client.experiment(
"Tutorial",
dataset=[
{
"evaluated_model_input": "Translate 'Good night' to French.",
"evaluated_model_output": "bonne nuit",
"evaluated_model_gold_answer": "Bonne nuit",
},
{
"evaluated_model_input": "Summarize: 'AI improves efficiency'.",
"evaluated_model_output": "ai improves efficiency",
"evaluated_model_gold_answer": "AI improves efficiency",
},
],
evaluators=[iexact_match],
experiment_name="Case Insensitive Match",
)
See Using Evaluators in Logging for how to register function based evaluators for use in logging and real time monitoring.
Updated 9 days ago