User Defined Evaluators 🖊️

Developers can create evaluators in code that define custom evaluation logic in our SDK.

An Evaluator class consists of the following components:

evaluate function: The logic for executing an evaluation. This can be a heuristic check, an LLM call, or even leverage tools or access your local filesystem to validate outputs.
pass_threshold: A float value defining the cutoff for whether a raw score passes or fails the evaluation. This field is optional.
EvaluationResult: An EvaluationResult consists of the following fields
- score_raw: raw score defining
- pass_: A boolean indicating whether the score meets the pass threshold.
- tags: A dictionary that can store additional metadata as key-value pairs, such as the threshold used during the evaluation.

Example: BERTScore

We'll use BERTScore to measure embedding similarity. BERTScore measures the cosine similarity between two BERT embeddings, which can be used to compare string similarity. In our case, we want to compare the model's output to the gold answer. The output doesn't need to be an exact match, but it should be close. Additionally, we want to be able to set a threshold to determine whether the evaluation passes or not.

Before we can start we need to install the Transformers and PyTorch dependencies.

pip install transformers torch

Now we can write our class-based evaluator.

from transformers import BertTokenizer, BertModel
import numpy as np
from patronus import Evaluator, EvaluationResult, Row


class BERTScore(Evaluator):
    def __init__(self, pass_threshold: float):
        self.pass_threshold = pass_threshold
        self.tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
        self.model = BertModel.from_pretrained("bert-base-uncased")

        super().__init__()

    def evaluate(self, row: Row) -> EvaluationResult:
        # Tokenize text
        output_toks = self.tokenizer(row.evaluated_model_output, return_tensors="pt", padding=True, truncation=True)
        gold_answer_toks = self.tokenizer(
            row.evaluated_model_gold_answer, return_tensors="pt", padding=True, truncation=True
        )

        # Obtain embeddings from BERT model
        output_embeds = self.model(**output_toks).last_hidden_state.mean(dim=1).detach().numpy()
        gold_answer_embeds = self.model(**gold_answer_toks).last_hidden_state.mean(dim=1).detach().numpy()

        # Calculate cosine similarity
        score = np.dot(output_embeds, gold_answer_embeds.T) / (
            np.linalg.norm(output_embeds) * np.linalg.norm(gold_answer_embeds)
        )

        return EvaluationResult(
            score_raw=score,
            pass_=score >= self.pass_threshold,
            tags={"pass_threshold": str(self.pass_threshold)},
        )

A class-based evaluator needs to inherit from the Evaluator base class and must implement the evaluate() method. Similar to the function-based evaluator, the evaluate() method only accepts predefined parameter names.

The return type of the evaluate() method can be a bool or an EvaluationResult object, as shown in this example. The EvaluationResult object provides a more detailed assessment by including:

score_raw: The calculated score reflecting the similarity between the evaluated output and the gold answer. While the score is often normalized between 0 and 1, with 1 representing an exact match, it doesn’t have to be normalized.
pass_: The boolean indicating whether the score meets the pass threshold. Since BERTScore is continuous, we set a pass threshold.
tags: A dictionary that can store additional metadata as key-value pairs. Here we log the threshold used during the evaluation.

If the return type is a boolean, that is equivalent to returning an EvaluationResult object with only the pass_ value set.