Class Based Evaluators

For most evals, we recommend creating class based evaluators. To create a class based evaluator, simply create a class that inherits from patronus.Evaluator. Class based evaluators can be extended for any imaginable use case, including:

  • Embedding similarity measurements
  • Custom LLM judges
  • Validation against data retrieved from internal APIs

An Evaluator class consists of the following components:

  • evaluate function: The logic for executing an evaluation. This can be a heuristic check, an LLM call, or even leverage tools or access your local filesystem to validate outputs.
  • pass_threshold: A float value defining the cutoff for whether a raw score passes or fails the evaluation. This field is optional.
  • EvaluationResult: An EvaluationResult consists of the following fields
    • score_raw: raw score defining
    • pass_: A boolean indicating whether the score meets the pass threshold.
    • tags: A dictionary that can store additional metadata as key-value pairs, such as the threshold used during the evaluation.

Example: BERTScore

We'll use BERTScore to measure embedding similarity. BERTScore measures the cosine similarity between two BERT embeddings, which can be used to compare string similarity. In our case, we want to compare the model's output to the gold answer. The output doesn't need to be an exact match, but it should be close. Additionally, we want to be able to set a threshold to determine whether the evaluation passes or not.

Before we can start we need to install the Transformers and PyTorch dependencies.

pip install transformers torch

Now we can write our class-based evaluator.

from transformers import BertTokenizer, BertModel
import numpy as np
from patronus import Evaluator, EvaluationResult, Row


class BERTScore(Evaluator):
    def __init__(self, pass_threshold: float):
        self.pass_threshold = pass_threshold
        self.tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
        self.model = BertModel.from_pretrained("bert-base-uncased")

        super().__init__()

    def evaluate(self, row: Row) -> EvaluationResult:
        # Tokenize text
        output_toks = self.tokenizer(row.evaluated_model_output, return_tensors="pt", padding=True, truncation=True)
        gold_answer_toks = self.tokenizer(
            row.evaluated_model_gold_answer, return_tensors="pt", padding=True, truncation=True
        )

        # Obtain embeddings from BERT model
        output_embeds = self.model(**output_toks).last_hidden_state.mean(dim=1).detach().numpy()
        gold_answer_embeds = self.model(**gold_answer_toks).last_hidden_state.mean(dim=1).detach().numpy()

        # Calculate cosine similarity
        score = np.dot(output_embeds, gold_answer_embeds.T) / (
            np.linalg.norm(output_embeds) * np.linalg.norm(gold_answer_embeds)
        )

        return EvaluationResult(
            score_raw=score,
            pass_=score >= self.pass_threshold,
            tags={"pass_threshold": str(self.pass_threshold)},
        )

A class-based evaluator needs to inherit from the Evaluator base class and must implement the evaluate() method. Similar to the function-based evaluator, the evaluate() method only accepts predefined parameter names.

The return type of the evaluate() method can be a bool or an EvaluationResult object, as shown in this example. The EvaluationResult object provides a more detailed assessment by including:

  • score_raw: The calculated score reflecting the similarity between the evaluated output and the gold answer. While the score is often normalized between 0 and 1, with 1 representing an exact match, it doesn’t have to be normalized.
  • pass_: The boolean indicating whether the score meets the pass threshold. Since BERTScore is continuous, we set a pass threshold.
  • tags: A dictionary that can store additional metadata as key-value pairs. Here we log the threshold used during the evaluation.

If the return type is a boolean, that is equivalent to returning an EvaluationResult object with only the pass_ value set.

To run an experiment with a class based Evaluator, provide the class name in the evaluators array. You can instantiate the evaluator instance with parameters as follows:

client.experiment(
    "Tutorial",
    dataset=[
        {
            "evaluated_model_input": "Translate 'Goodbye' to Spanish.",
            "evaluated_model_output": "Hasta luego",
            "evaluated_model_gold_answer": "Adiós",
        },
        {
            "evaluated_model_input": "Summarize: 'The quick brown fox jumps over the lazy dog'.",
            "evaluated_model_output": "Quick brown fox jumps over dog",
            "evaluated_model_gold_answer": "The quick brown fox jumps over the lazy dog",
        },
    ],
    evaluators=[BERTScore(pass_threshold=0.8)],
    experiment_name="BERTScore Output Label Similarity",
)