Function Based Evaluators
The fastest way to define an evaluator is to create a function with the @evaluator
decorator in our Python SDK. This is recommended for simple heuristic-based evals, such as
- Schema validation on structured outputs
- Regex and string matching
- Length checks
Let's define a simple evaluator that will compare the model output to the gold answer. (This evaluator is case insensitive and ignores leading and trailing whitespaces.)
In the example above, our evaluator returns a boolean value. The framework will automatically convert this into an EvaluationResult
object. Alternatively, you can return an EvaluationResult
in the function:
Evaluator Function Inputs
An evaluator function accepts the following inputs.
Field | Type | Description |
---|---|---|
row | patronus.Row | Object containing data needed for evaluation, such as model inputs and outputs. Rows do not have enforced schemas and can therefore be used with any data model. |
task_result | patronus.TaskResult | The result object returned by the task execution. This is not required for experiments that do not use tasks (i.e. the dataset contains all the information needed for the eval). |
evaluated_model_input | str | Input provided to the system to be evaluated. |
evaluated_model_output | str | The output generated by the system to be evaluated. |
evaluated_model_retrieved_context | list[str] | A list of context strings retrieved and provided to the model as additional information. This is typically used in a Retrieval-Augmented Generation (RAG) setup, where the model's response depends on external context or supporting information that has been fetched from a knowledge base or similar source. |
evaluated_model_gold_answer | str | The expected or correct answer that the model output is compared against during evaluation. |
The framework will automatically inject the appropriate values based on the parameter names you specify.
Evaluator functions can return any of the following outputs:
bool
: A boolean value indicating whether the instance passes or fails the evaluation.EvaluationResult
: AnEvaluationResult
consists of the following fieldsscore_raw
: raw score definingpass_
: A boolean indicating whether the score meets the pass threshold.tags
: A dictionary that can store additional metadata as key-value pairs, such as the threshold used during the evaluation.
Full Code Example (Experiment)
Below is a complete example demonstrating how to use the iexact_match
evaluator in an experiment.
See Using Evaluators in Logging for how to register function based evaluators for use in logging and real time monitoring.