User Defined Evaluators ποΈ
Developers can create evaluators in code that define custom evaluation logic in our SDK.
An Evaluator
class consists of the following components:
evaluate
function: The logic for executing an evaluation. This can be a heuristic check, an LLM call, or even leverage tools or access your local filesystem to validate outputs.pass_threshold
: A float value defining the cutoff for whether a raw score passes or fails the evaluation. This field is optional.EvaluationResult
: AnEvaluationResult
consists of the following fieldsscore_raw
: raw score definingpass_
: A boolean indicating whether the score meets the pass threshold.tags
: A dictionary that can store additional metadata as key-value pairs, such as the threshold used during the evaluation.
Example: BERTScore
We'll use BERTScore to measure embedding similarity. BERTScore measures the cosine similarity between two BERT embeddings, which can be used to compare string similarity. In our case, we want to compare the model's output to the gold answer. The output doesn't need to be an exact match, but it should be close. Additionally, we want to be able to set a threshold to determine whether the evaluation passes or not.
Before we can start we need to install the Transformers and PyTorch dependencies.
pip install transformers torch
Now we can write our class-based evaluator.
from transformers import BertTokenizer, BertModel
import numpy as np
from patronus import Evaluator, EvaluationResult, Row
class BERTScore(Evaluator):
def __init__(self, pass_threshold: float):
self.pass_threshold = pass_threshold
self.tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
self.model = BertModel.from_pretrained("bert-base-uncased")
super().__init__()
def evaluate(self, row: Row) -> EvaluationResult:
# Tokenize text
output_toks = self.tokenizer(row.evaluated_model_output, return_tensors="pt", padding=True, truncation=True)
gold_answer_toks = self.tokenizer(
row.evaluated_model_gold_answer, return_tensors="pt", padding=True, truncation=True
)
# Obtain embeddings from BERT model
output_embeds = self.model(**output_toks).last_hidden_state.mean(dim=1).detach().numpy()
gold_answer_embeds = self.model(**gold_answer_toks).last_hidden_state.mean(dim=1).detach().numpy()
# Calculate cosine similarity
score = np.dot(output_embeds, gold_answer_embeds.T) / (
np.linalg.norm(output_embeds) * np.linalg.norm(gold_answer_embeds)
)
return EvaluationResult(
score_raw=score,
pass_=score >= self.pass_threshold,
tags={"pass_threshold": str(self.pass_threshold)},
)
A class-based evaluator needs to inherit from the Evaluator
base class and must implement the evaluate()
method. Similar to the function-based evaluator, the evaluate()
method only accepts predefined parameter names.
The return type of the evaluate()
method can be a bool or an EvaluationResult
object, as shown in this example. The EvaluationResult
object provides a more detailed assessment by including:
score_raw
: The calculated score reflecting the similarity between the evaluated output and the gold answer. While the score is often normalized between 0 and 1, with 1 representing an exact match, it doesnβt have to be normalized.pass_
: The boolean indicating whether the score meets the pass threshold. Since BERTScore is continuous, we set a pass threshold.tags
: A dictionary that can store additional metadata as key-value pairs. Here we log the threshold used during the evaluation.
If the return type is a boolean, that is equivalent to returning an EvaluationResult
object with only the pass_
value set.
Updated 6 days ago