Function Based Evaluators

The fastest way to define an evaluator is to create a function with the @evaluator decorator in our Python SDK. This is recommended for simple heuristic-based evals, such as

  • Schema validation on structured outputs
  • Regex and string matching
  • Length checks

Let's define a simple evaluator that will compare the model output to the gold answer. (This evaluator is case insensitive and ignores leading and trailing whitespaces.)

from patronus import evaluator, Row

@evaluator
def iexact_match(row: Row) -> bool:
    return row.evaluated_model_output.lower().strip() == row.evaluated_model_gold_answer.lower().strip()

In the example above, our evaluator returns a boolean value. The framework will automatically convert this into an EvaluationResult object. Alternatively, you can return an EvaluationResult in the function:

from patronus import evaluator, Row, EvaluationResult

@evaluator
def iexact_match(row: Row) -> bool:
    pass_result = row.evaluated_model_output.lower().strip() == row.evaluated_model_gold_answer.lower().strip()
    score = 1 if pass_result else 0
    return EvaluationResult(
      score_raw=score,
      pass_=pass_result,
    )

Evaluator Function Inputs

An evaluator function accepts the following inputs.

FieldTypeDescription
rowpatronus.RowObject containing data needed for evaluation, such as model inputs and outputs. Rows do not have enforced schemas and can therefore be used with any data model.
task_resultpatronus.TaskResultThe result object returned by the task execution. This is not required for experiments that do not use tasks (i.e. the dataset contains all the information needed for the eval).
evaluated_model_inputstrInput provided to the system to be evaluated.
evaluated_model_outputstrThe output generated by the system to be evaluated.
evaluated_model_retrieved_contextlist[str]A list of context strings retrieved and provided to the model as additional information. This is typically used in a Retrieval-Augmented Generation (RAG) setup, where the model's response depends on external context or supporting information that has been fetched from a knowledge base or similar source.
evaluated_model_gold_answerstrThe expected or correct answer that the model output is compared against during evaluation.

The framework will automatically inject the appropriate values based on the parameter names you specify.

Evaluator functions can return any of the following outputs:

  • bool: A boolean value indicating whether the instance passes or fails the evaluation.
  • EvaluationResult: An EvaluationResult consists of the following fields
    • score_raw: raw score defining
    • pass_: A boolean indicating whether the score meets the pass threshold.
    • tags: A dictionary that can store additional metadata as key-value pairs, such as the threshold used during the evaluation.

Full Code Example (Experiment)

Below is a complete example demonstrating how to use the iexact_match evaluator in an experiment.

from patronus import Client, evaluator, Row

client = Client()


@evaluator
def iexact_match(row: Row) -> bool:
    return row.evaluated_model_output.lower().strip() == row.evaluated_model_gold_answer.lower().strip()


client.experiment(
    "Tutorial",
    dataset=[
        {
            "evaluated_model_input": "Translate 'Good night' to French.",
            "evaluated_model_output": "bonne nuit",
            "evaluated_model_gold_answer": "Bonne nuit",
        },
        {
            "evaluated_model_input": "Summarize: 'AI improves efficiency'.",
            "evaluated_model_output": "ai improves efficiency",
            "evaluated_model_gold_answer": "AI improves efficiency",
        },
    ],
    evaluators=[iexact_match],
    experiment_name="Case Insensitive Match",
)

See Using Evaluators in Logging for how to register function based evaluators for use in logging and real time monitoring.