Working with Evaluators
The Patronus AI Experimentation Framework lets you to write your own evaluators that can run locally, and use remote Patronus-hosted Evaluators. This document will show you how to write your own evaluators and how to use them with Patronus Evaluators.
Function-Based evaluators
Let's define a simple evaluator that will compare the model output to the gold answer. This evaluator is case insensitive and ignores leading and trailing whitespaces.
from patronus import evaluator
@evaluator
def iexact_match(evaluated_model_output: str, evaluated_model_gold_answer: str) -> bool:
return evaluated_model_output.lower().strip() == evaluated_model_gold_answer.lower().strip()
To define a function-based evaluator, you need to wrap your evaluator function with the @evaluator
decorator.
When defining parameters for the function, you must name them correctly. The available options are:
evaluated_model_system_prompt
evaluated_model_retrieved_context
evaluated_model_input
evaluated_model_output
evaluated_model_gold_answer
These parameters will be injected by the framework. The values correspond directly to the fields provided in the dataset, or, in the case of evaluated_model_output
, it may be returned by the task
.
In its simplest form, an evaluator only needs to return a boolean value. However, an evaluator can also return an EvaluationResult
object, which we'll discuss in the next section.
Full Code Example
Below is a complete example demonstrating how to use the iexact_match
evaluator:
from patronus import Client, evaluator
client = Client()
@evaluator
def iexact_match(evaluated_model_output: str, evaluated_model_gold_answer: str) -> bool:
return evaluated_model_output.lower().strip() == evaluated_model_gold_answer.lower().strip()
client.experiment(
"Tutorial",
data=[
{
"evaluated_model_input": "Translate 'Good night' to French.",
"evaluated_model_output": "bonne nuit",
"evaluated_model_gold_answer": "Bonne nuit",
},
{
"evaluated_model_input": "Summarize: 'AI improves efficiency'.",
"evaluated_model_output": "ai improves efficiency",
"evaluated_model_gold_answer": "AI improves efficiency",
},
],
evaluators=[iexact_match],
experiment_name="Case Insensitive Match",
)
Class-Based Evaluators
For a more complex example, we'll use the Levenshtein distance. Levenshtein distance measures the difference between two strings. In our case, we want to compare the model's output to the gold answer. The output doesn't need to be an exact match, but it should be close. Additionally, we want to be able to set a threshold to determine whether the evaluation passes or not.
Before we can start we need to install Levenshtein dependency.
pip install Levenshtein
Now we can write our class-based evaluator.
from Levenshtein import distance
from patronus import Client, Evaluator, EvaluationResult, simple_task
class LevenshteinScorer(Evaluator):
def __init__(self, pass_threshold: float):
self.pass_threshold = pass_threshold
super().__init__()
def evaluate(self, evaluated_model_output: str, evaluated_model_gold_answer: str) -> EvaluationResult:
max_len = max(len(x) for x in [evaluated_model_output, evaluated_model_gold_answer])
score = 1
if max_len > 0:
score = 1 - (distance(evaluated_model_output, evaluated_model_gold_answer) / max_len)
return EvaluationResult(
score_raw=score,
pass_=score >= self.pass_threshold,
tags={"pass_threshold": str(self.pass_threshold)},
)
A class-based evaluator needs to inherit from the Evaluator
base class and must implement the evaluate()
method. Similar to the function-based evaluator, the evaluate()
method only accepts predefined parameter names (e.g. evaluated_model_output
, evaluated_model_gold_answer
).
The return type of the evaluate()
method can be a bool or an EvaluationResult
object, as shown in this example. The EvaluationResult
object provides a more detailed assessment by including:
score_raw
: The calculated score reflecting the similarity between the evaluated output and the gold answer. While the score is often normalized between 0 and 1, with 1 representing an exact match, it doesn’t have to be normalized.pass_
: A boolean indicating whether the score meets the pass threshold.tags
: A dictionary that can store additional metadata as key-value pairs, such as the threshold used during the evaluation.
If the return type is a boolean, that is equivalent to returning an EvaluationResult
object with only the pass_
value set.
Full Code Example
Below is a complete example demonstrating how to use the LevenshteinScorer evaluator to assess the similarity between model outputs and gold answers.
from Levenshtein import distance
from patronus import Client, Evaluator, EvaluationResult
client = Client()
class LevenshteinScorer(Evaluator):
def __init__(self, pass_threshold: float):
self.pass_threshold = pass_threshold
super().__init__()
def evaluate(self, evaluated_model_output: str, evaluated_model_gold_answer: str) -> EvaluationResult:
max_len = max(len(x) for x in [evaluated_model_output, evaluated_model_gold_answer])
score = 1
if max_len > 0:
score = 1 - (distance(evaluated_model_output, evaluated_model_gold_answer) / max_len)
return EvaluationResult(
score_raw=score,
pass_=score >= self.pass_threshold,
tags={"pass_threshold": str(self.pass_threshold)},
)
client.experiment(
"Tutorial",
data=[
{
"evaluated_model_input": "Translate 'Goodbye' to Spanish.",
"evaluated_model_output": "Hasta luego",
"evaluated_model_gold_answer": "Adiós",
},
{
"evaluated_model_input": "Summarize: 'The quick brown fox jumps over the lazy dog'.",
"evaluated_model_output": "Quick brown fox jumps over dog",
"evaluated_model_gold_answer": "The quick brown fox jumps over the lazy dog",
},
],
evaluators=[LevenshteinScorer(pass_threshold=0.8)],
experiment_name="Levenshtein Distance",
)
Async Evaluators
The Patronus Experimentation Framework supports both synchronous and asynchronous evaluators. Defining an asynchronous evaluator is as simple as using Python's async functions.
from patronus import evaluator, Evaluator
@evaluator
async def my_evaluator(...):
...
class MyEvaluator(Evaluator):
...
async def evaluate(...):
...
Remote Patronus Evaluators
Referencing Evaluators (PII Evaluator Example)
Using remote Patronus Evaluators is straightforward. For Evaluators that do not require Profiles, you can reference them by simply providing an ID or an alias:
from patronus import Client
client = Client()
# Reference remote evaluator by alias
detect_pii = client.remote_evaluator("pii")
cli.experiment(
"Tutorial",
data=...,
task=...,
evaluators=[detect_pii],
)
The code above is equivalent to:
detect_pii = client.remote_evaluator(
evaluator="pii-2024-05-31",
profile_name="system:detect-personally-identifiable-information",
)
Note, instead of specifying the full evaluator ID, we've used the alias pii
. Additionally, since the pii
evaluator implicitly uses the system:detect-personally-identifiable-information
, you don’t need to specify the profile name.
Full Code Example
Below is an example of how to use the pii
evaluator to detect personally identifiable information (PII) in model outputs. The evaluator is referenced by its alias:
from patronus import Client
client = Client()
detect_pii = client.remote_evaluator("pii")
client.experiment(
"Tutorial",
data=[
{
"evaluated_model_input": "Please provide your contact details.",
"evaluated_model_output": "My email is [email protected] and my phone number is 123-456-7890.",
},
{
"evaluated_model_input": "Share your personal information.",
"evaluated_model_output": "My name is Jane Doe and I live at 123 Elm Street.",
},
],
evaluators=[detect_pii],
experiment_name="Detect PII",
)
Referencing Evaluators (Custom Evaluator Example)
For evaluators that require Profiles, such as the Custom Evaluator, you must provide the profile_name
along with the evaluator's ID or alias.
is_polite_evaluator = client.remote_evaluator(
"custom-large",
"system:is-polite"
)
In this example, the evaluator "custom-large" is referenced with a specific system profile managed by the Patronus AI team, which is available in your account out of the box.
Creating Evaluator Profiles Dynamically (Custom Evaluator)
If you need to create a profile from code, which is often required for a Custom Evaluator, you can specify its profile configuration directly in your code:
evaluate_proper_language = cli.remote_evaluator(
"custom-large",
"detect-requested-programming-languages",
profile_config={
"pass_criteria": textwrap.dedent(
"""
The MODEL OUTPUT should provide only valid code in any well-known programming language.
"""
),
}
)
If you attempt to tweak the profile configuration for an existing profile, it will throw an error. This safeguard is in place because modifying a profile can be risky. Changes may impact other users in your account who rely on the same profile or could disrupt a production workload that is relying on that profile.
To update an evaluator profile, you must explicitly pass the allow_update=True
argument:.
import textwrap
from patronus import Client
cli = Client()
evaluate_proper_language = cli.remote_evaluator(
"custom-large",
"detect-requested-programming-languages",
profile_config={
"pass_criteria": textwrap.dedent(
"""
The MODEL OUTPUT should provide only valid code in any well-known programming language.
The MODEL OUTPUT should consist of the code in a programming language specified in the USER INPUT.
"""
),
},
allow_update=True,
)
Full Code Example
Here’s a complete example that demonstrates how to create a dynamic evaluator profile, define evaluation criteria, and run an experiment:
import logging
import textwrap
from patronus import Client, task
client = Client()
evaluate_proper_language = client.remote_evaluator(
"custom-large",
"detect-requested-programming-languages",
profile_config={
"pass_criteria": textwrap.dedent(
"""
The MODEL OUTPUT should provide only valid code in any well-known programming language.
The MODEL OUTPUT should consist of the code in a programming language specified in the USER INPUT.
"""
),
},
allow_update=True,
)
data = [
{
"evaluated_model_input": "Write a hello world example in Python.",
"evaluated_model_output": "print('Hello World!')",
},
{
"evaluated_model_input": "Write a hello world example in JavaScript.",
"evaluated_model_output": "print('Hello World!')",
},
]
client.experiment(
"Tutorial",
data=data,
evaluators=[evaluate_proper_language],
experiment_name="Detect Programming Languages",
)
Updated 2 days ago