Class Based Evaluators
For most evals, we recommend creating class based evaluators. To create a class based evaluator, simply create a class that inherits from patronus.Evaluator
. Class based evaluators can be extended for any imaginable use case, including:
- Embedding similarity measurements
- Custom LLM judges
- Validation against data retrieved from internal APIs
An Evaluator
class consists of the following components:
evaluate
function: The logic for executing an evaluation. This can be a heuristic check, an LLM call, or even leverage tools or access your local filesystem to validate outputs.pass_threshold
: A float value defining the cutoff for whether a raw score passes or fails the evaluation. This field is optional.EvaluationResult
: AnEvaluationResult
consists of the following fieldsscore_raw
: raw score definingpass_
: A boolean indicating whether the score meets the pass threshold.tags
: A dictionary that can store additional metadata as key-value pairs, such as the threshold used during the evaluation.
Example: BERTScore
We'll use BERTScore to measure embedding similarity. BERTScore measures the cosine similarity between two BERT embeddings, which can be used to compare string similarity. In our case, we want to compare the model's output to the gold answer. The output doesn't need to be an exact match, but it should be close. Additionally, we want to be able to set a threshold to determine whether the evaluation passes or not.
Before we can start we need to install the Transformers and PyTorch dependencies.
pip install transformers torch
Now we can write our class-based evaluator.
from transformers import BertTokenizer, BertModel
import numpy as np
from patronus import Evaluator, EvaluationResult, Row
class BERTScore(Evaluator):
def __init__(self, pass_threshold: float):
self.pass_threshold = pass_threshold
self.tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
self.model = BertModel.from_pretrained("bert-base-uncased")
super().__init__()
def evaluate(self, row: Row) -> EvaluationResult:
# Tokenize text
output_toks = self.tokenizer(row.evaluated_model_output, return_tensors="pt", padding=True, truncation=True)
gold_answer_toks = self.tokenizer(
row.evaluated_model_gold_answer, return_tensors="pt", padding=True, truncation=True
)
# Obtain embeddings from BERT model
output_embeds = self.model(**output_toks).last_hidden_state.mean(dim=1).detach().numpy()
gold_answer_embeds = self.model(**gold_answer_toks).last_hidden_state.mean(dim=1).detach().numpy()
# Calculate cosine similarity
score = np.dot(output_embeds, gold_answer_embeds.T) / (
np.linalg.norm(output_embeds) * np.linalg.norm(gold_answer_embeds)
)
return EvaluationResult(
score_raw=score,
pass_=score >= self.pass_threshold,
tags={"pass_threshold": str(self.pass_threshold)},
)
A class-based evaluator needs to inherit from the Evaluator
base class and must implement the evaluate()
method. Similar to the function-based evaluator, the evaluate()
method only accepts predefined parameter names.
The return type of the evaluate()
method can be a bool or an EvaluationResult
object, as shown in this example. The EvaluationResult
object provides a more detailed assessment by including:
score_raw
: The calculated score reflecting the similarity between the evaluated output and the gold answer. While the score is often normalized between 0 and 1, with 1 representing an exact match, it doesn’t have to be normalized.pass_
: The boolean indicating whether the score meets the pass threshold. Since BERTScore is continuous, we set a pass threshold.tags
: A dictionary that can store additional metadata as key-value pairs. Here we log the threshold used during the evaluation.
If the return type is a boolean, that is equivalent to returning an EvaluationResult
object with only the pass_
value set.
To run an experiment with a class based Evaluator, provide the class name in the evaluators
array. You can instantiate the evaluator instance with parameters as follows:
client.experiment(
"Tutorial",
dataset=[
{
"evaluated_model_input": "Translate 'Goodbye' to Spanish.",
"evaluated_model_output": "Hasta luego",
"evaluated_model_gold_answer": "Adiós",
},
{
"evaluated_model_input": "Summarize: 'The quick brown fox jumps over the lazy dog'.",
"evaluated_model_output": "Quick brown fox jumps over dog",
"evaluated_model_gold_answer": "The quick brown fox jumps over the lazy dog",
},
],
evaluators=[BERTScore(pass_threshold=0.8)],
experiment_name="BERTScore Output Label Similarity",
)
Updated 9 days ago