Working with Evaluators

The Patronus AI Experimentation Framework lets you to write your own evaluators that can run locally, and use remote Patronus-hosted Evaluators. This document will show you how to write your own evaluators and how to use them with Patronus Evaluators.

Function-Based evaluators

Let's define a simple evaluator that will compare the model output to the gold answer. (This evaluator is case insensitive and ignores leading and trailing whitespaces.)

from patronus import evaluator, Row

@evaluator
def iexact_match(row: Row) -> bool:
    return row.evaluated_model_output.lower().strip() == row.evaluated_model_gold_answer.lower().strip()

To define a function-based evaluator, you need to wrap your evaluator function with the @evaluator decorator. The function can accept any of the parameters described in the Evaluator Definition section. The framework will automatically inject the appropriate values based on the parameter names you specify.

In its simplest form, as shown in the example above, an evaluator can return a boolean value. The framework will automatically convert this into an EvaluationResult object. For more complex evaluations, you can return any of the supported return types described in the Evaluator Definition section.

Full Code Example

Below is a complete example demonstrating how to use the iexact_match evaluator:

from patronus import Client, evaluator, Row

client = Client()


@evaluator
def iexact_match(row: Row) -> bool:
    return row.evaluated_model_output.lower().strip() == row.evaluated_model_gold_answer.lower().strip()


client.experiment(
    "Tutorial",
    dataset=[
        {
            "evaluated_model_input": "Translate 'Good night' to French.",
            "evaluated_model_output": "bonne nuit",
            "evaluated_model_gold_answer": "Bonne nuit",
        },
        {
            "evaluated_model_input": "Summarize: 'AI improves efficiency'.",
            "evaluated_model_output": "ai improves efficiency",
            "evaluated_model_gold_answer": "AI improves efficiency",
        },
    ],
    evaluators=[iexact_match],
    experiment_name="Case Insensitive Match",
)

Class-Based Evaluators

For a more complex example, we'll use BERTScore to measure embedding similarity. BERTScore measures the cosine similarity between two BERT embeddings, which can be used to compare string similarity. In our case, we want to compare the model's output to the gold answer. The output doesn't need to be an exact match, but it should be close. Additionally, we want to be able to set a threshold to determine whether the evaluation passes or not.

Before we can start we need to install the Transformers and PyTorch dependencies.

pip install transformers torch

Now we can write our class-based evaluator.

from transformers import BertTokenizer, BertModel
import numpy as np
from patronus import Evaluator, EvaluationResult, Row


class BERTScore(Evaluator):
    def __init__(self, pass_threshold: float):
        self.pass_threshold = pass_threshold
        self.tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
        self.model = BertModel.from_pretrained("bert-base-uncased")

        super().__init__()

    def evaluate(self, row: Row) -> EvaluationResult:
        # Tokenize text
        output_toks = self.tokenizer(row.evaluated_model_output, return_tensors="pt", padding=True, truncation=True)
        gold_answer_toks = self.tokenizer(
            row.evaluated_model_gold_answer, return_tensors="pt", padding=True, truncation=True
        )

        # Obtain embeddings from BERT model
        output_embeds = self.model(**output_toks).last_hidden_state.mean(dim=1).detach().numpy()
        gold_answer_embeds = self.model(**gold_answer_toks).last_hidden_state.mean(dim=1).detach().numpy()

        # Calculate cosine similarity
        score = np.dot(output_embeds, gold_answer_embeds.T) / (
            np.linalg.norm(output_embeds) * np.linalg.norm(gold_answer_embeds)
        )

        return EvaluationResult(
            score_raw=score,
            pass_=score >= self.pass_threshold,
            tags={"pass_threshold": str(self.pass_threshold)},
        )

A class-based evaluator needs to inherit from the Evaluator base class and must implement the evaluate() method. Similar to the function-based evaluator, the evaluate() method only accepts predefined parameter names.

The return type of the evaluate() method can be a bool or an EvaluationResult object, as shown in this example. The EvaluationResult object provides a more detailed assessment by including:

  • score_raw: The calculated score reflecting the similarity between the evaluated output and the gold answer. While the score is often normalized between 0 and 1, with 1 representing an exact match, it doesnโ€™t have to be normalized.
  • pass_: A boolean indicating whether the score meets the pass threshold.
  • tags: A dictionary that can store additional metadata as key-value pairs, such as the threshold used during the evaluation.

If the return type is a boolean, that is equivalent to returning an EvaluationResult object with only the pass_ value set.

Full Code Example

Below is a complete example demonstrating how to use the BertScore evaluator to assess the similarity between model outputs and gold answers.

from transformers import BertTokenizer, BertModel
import numpy as np
from patronus import Client, Evaluator, EvaluationResult, Row

client = Client()


class BERTScore(Evaluator):
    def __init__(self, pass_threshold: float):
        self.pass_threshold = pass_threshold
        self.tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
        self.model = BertModel.from_pretrained("bert-base-uncased")

        super().__init__()

    def evaluate(self, row: Row) -> EvaluationResult:
        # Tokenize text
        output_toks = self.tokenizer(row.evaluated_model_output, return_tensors="pt", padding=True, truncation=True)
        gold_answer_toks = self.tokenizer(
            row.evaluated_model_gold_answer, return_tensors="pt", padding=True, truncation=True
        )

        # Obtain embeddings from BERT model
        output_embeds = self.model(**output_toks).last_hidden_state.mean(dim=1).detach().numpy()
        gold_answer_embeds = self.model(**gold_answer_toks).last_hidden_state.mean(dim=1).detach().numpy()

        # Calculate cosine similarity
        score = np.dot(output_embeds, gold_answer_embeds.T) / (
            np.linalg.norm(output_embeds) * np.linalg.norm(gold_answer_embeds)
        )

        return EvaluationResult(
            score_raw=score,
            pass_=score >= self.pass_threshold,
            tags={"pass_threshold": str(self.pass_threshold)},
        )


client.experiment(
    "Tutorial",
    dataset=[
        {
            "evaluated_model_input": "Translate 'Goodbye' to Spanish.",
            "evaluated_model_output": "Hasta luego",
            "evaluated_model_gold_answer": "Adiรณs",
        },
        {
            "evaluated_model_input": "Summarize: 'The quick brown fox jumps over the lazy dog'.",
            "evaluated_model_output": "Quick brown fox jumps over dog",
            "evaluated_model_gold_answer": "The quick brown fox jumps over the lazy dog",
        },
    ],
    evaluators=[BERTScore(pass_threshold=0.8)],
    experiment_name="BERTScore Output Label Similarity",
)

Async Evaluators

The Patronus Experimentation Framework supports both synchronous and asynchronous evaluators. Defining an asynchronous evaluator is as simple as using Python's async functions.

from patronus import evaluator, Evaluator

@evaluator
async def my_evaluator(row):
    ...

class MyEvaluator(Evaluator):
    ...
    async def evaluate(row):
        ...

Conditional evaluations

You can skip an evaluation by returning None. This is useful when an evaluation should only run under specific conditions.

@evaluator
def validate_python_syntax(row: Row) -> bool | None:
    if row.language != "python":
        return None
        
    return check_synatx(row.code_block)

Patronus Evaluators

Patronus Evaluators are remotely hosted evaluators developed and maintained by the Patronus AI team. The following section describes how to use these evaluators in your evaluation workflow, and combine them with local evaluators.

Patronus remote evaluators expect your dataset to use the standard field names (evaluated_model_input, evaluated_model_output, etc.). If your dataset uses different field names, or you wish to use it with chaining, you can use the RemoteEvaluator.wrap decorator to map your fields to the expected names.

Referencing Evaluators (PII Evaluator Example)

Using Patronus Evaluators is straightforward. For Evaluators that do not require Profiles, you can reference them by simply providing an ID or an alias:

from patronus import Client

client = Client()

# Reference remote evaluator by alias
detect_pii = client.remote_evaluator("pii")

cli.experiment(
    "Tutorial",
    dataset=...,
    task=...,
    evaluators=[detect_pii],
)

Full Code Example

Below is an example of how to use the pii evaluator to detect personally identifiable information (PII) in model outputs. The evaluator is referenced by its alias:

from patronus import Client

client = Client()

detect_pii = client.remote_evaluator("pii")

client.experiment(
    "Tutorial",
    dataset=[
        {
            "evaluated_model_input": "Please provide your contact details.",
            "evaluated_model_output": "My email is [email protected] and my phone number is 123-456-7890.",
        },
        {
            "evaluated_model_input": "Share your personal information.",
            "evaluated_model_output": "My name is Jane Doe and I live at 123 Elm Street.",
        },
    ],
    evaluators=[detect_pii],
    experiment_name="Detect PII",
)

Referencing Evaluators (Judge Evaluator Example)

For evaluators that require Profiles, such as the Judge Evaluator, you must provide the profile_name along with the evaluator's ID or alias.

is_polite_evaluator = client.remote_evaluator(
    "judge-large",
    "patronus:is-polite"
)

In this example, the evaluator "judge-large" is referenced with a specific system profile managed by the Patronus AI team, which is available in your account out of the box.

Creating Evaluator Profiles Dynamically (Judge Evaluator)

If you need to create a profile from code, which is often required for a Judge Evaluator, you can specify its profile configuration directly in your code:

evaluate_proper_language = cli.remote_evaluator(
    "judge-large",
    "detect-requested-programming-languages",
    criteria_config={
        "pass_criteria": textwrap.dedent(
            """
            The MODEL OUTPUT should provide only valid code in any well-known programming language.
            """
        ),
    }
)

If you attempt to tweak the profile configuration for an existing profile, it will throw an error. This safeguard is in place because modifying a profile can be risky. Changes may impact other users in your account who rely on the same profile or could disrupt a production workload that is relying on that profile.

To update an evaluator profile, you must explicitly pass the allow_update=True argument:.

import textwrap
from patronus import Client

cli = Client()

evaluate_proper_language = cli.remote_evaluator(
    "judge-large",
    "detect-requested-programming-languages",
    profile_config={
        "pass_criteria": textwrap.dedent(
            """
            The MODEL OUTPUT should provide only valid code in any well-known programming language.
            The MODEL OUTPUT should consist of the code in a programming language specified in the USER INPUT.
            """
        ),
    },
    allow_update=True,
)

Full Code Example

Hereโ€™s a complete example that demonstrates how to create a dynamic evaluator profile, define evaluation criteria, and run an experiment:

import textwrap
from patronus import Client


client = Client()

evaluate_proper_language = client.remote_evaluator(
    "judge-large",
    "detect-requested-programming-languages",
    profile_config={
        "pass_criteria": textwrap.dedent(
            """
            The MODEL OUTPUT should provide only valid code in any well-known programming language.
            The MODEL OUTPUT should consist of the code in a programming language specified in the USER INPUT.
            """
        ),
    },
    allow_update=True,
)

dataset = [
    {
        "evaluated_model_input": "Write a hello world example in Python.",
        "evaluated_model_output": "print('Hello World!')",
    },
    {
        "evaluated_model_input": "Write a hello world example in JavaScript.",
        "evaluated_model_output": "print('Hello World!')",
    },
]

client.experiment(
    "Tutorial",
    dataset=dataset,
    evaluators=[evaluate_proper_language],
    experiment_name="Detect Programming Languages",
)

Wrapping Patronus Evaluators

The RemoteEvaluator.wrap decorator allows you to customize how remote evaluators interact with your data. This is particularly useful when you need to:

  • Map custom field names to standard Patronus fields
  • Conditionally skip evaluations based on your data
  • Preprocess data before evaluation
  • Work with chained evaluation results

Here's an example that demonstrates these capabilities:

from patronus import Client, Row, TaskResult, EvalParent
from patronus.evaluators_remote import EvaluateCall

cli = Client()

judge_eval = cli.remote_evaluator("judge", "is-valid-python-code")


@judge_eval.wrap
def is_valid_python_code(
    evaluate: EvaluateCall,
    row: Row,
    task_result: TaskResult,
    parent: EvalParent,
    # Remember to always add **kwargs to wrapped function
    **kwargs,
):
    # Skip evaluation if this is not part of a chain
    if not parent:
        return

    # Skip evaluation for non-Python code
    if row.codebase != "python":
        return None

    # Extract code block from markdown output
    code_block = extract_code_block(task_result.evaluated_model_output)

    # Call the remote evaluator with mapped fields
    return evaluate(
        evaluated_model_system_prompt=row.agent_prompt,
        evaluated_model_input=row.user_text,
        evaluated_model_output=code_block,
    )

Using Patronus Evaluators Outside Experiments

Remote evaluators can be used directly without running a full experiment. This is useful for real-time evaluations, testing, or when you need to integrate evaluations into your own workflows.

There are two ways to use remote evaluators:

  1. Using the Client.remote_evaluator() method (recommended):
from patronus import Client

client = Client()

eval_is_code = client.remote_evaluator("judge", "patronus:is-code", max_attempts=3)

result = await eval_is_code.evaluate(
    evaluated_model_system_prompt="You are a Python coding assistant. Write clean, well-documented Python code.",
    evaluated_model_retrieved_context=["Python's print() function displays text or variables to the console."],
    evaluated_model_input="Write a simple hello world program in Python.",
    evaluated_model_output="print('Hello, World!')",
    evaluated_model_gold_answer="print('Hello, World!')",
    app="my-app",
)

The remote_evaluator method instantiates a RemoteEvaluator class that provides additional features like creating and updating evaluation criteria in code and automatic retries. Note that the first call using a remote evaluator may be slower as it needs to verify its configuration.

  1. Using the lower-level API directly:
import patronus
from patronus import EvaluateRequest

client = patronus.Client()

result = await client.api.evaluate(
    EvaluateRequest(
        evaluators=[{"evaluator": "judge", "profile_name": "patronus:is-code"}],
        evaluated_model_system_prompt="You are a Python coding assistant. Write clean, well-documented Python code.",
        evaluated_model_retrieved_context=["Python's print() function displays text or variables to the console."],
        evaluated_model_input="Write a simple hello world program in Python.",
        evaluated_model_output="print('Hello, World!')",
        evaluated_model_gold_answer="print('Hello, World!')",
        app="my-app",
    )
)

This direct API approach is more lightweight but doesn't provide the additional features available through the RemoteEvaluator class.

Evaluator Definition

When creating an evaluator, whether function-based or class-based, you can access various parameters that provide context about the evaluation. These parameters must be named exactly as specified below:

  • row (patronus.Row): The complete row from the dataset, extending pandas.Series. It provides helper properties for well-defined fields like evaluated_model_input, evaluated_model_output, etc., making data access more convenient.

  • task_result (patronus.TaskResult): The result object returned by the task execution.

  • evaluated_model_system_prompt

  • evaluated_model_retrieved_context

  • evaluated_model_input

  • evaluated_model_output

  • evaluated_model_gold_answer

  • parent: Reference to results from previous chain links. This parameter is only available when using evaluation chaining in your experiment.

๐Ÿ“˜

It is recommended to use the evaluated_model_* fields names, as these fields are properly tracked and reported to the Patronus platform. While the raw data is available through the row parameter, using the dedicated fields ensures better visibility and reporting of your evaluation process.

Evaluators can return different types of values, which are automatically coerced into an EvaluationResult object:

  • bool: A simple pass/fail evaluation
  • float, int: A raw score value
  • EvaluationResult: A complete evaluation result object

The EvaluationResult class is defined as follows:

class EvaluationResult:
    pass_: Optional[bool] = None          # Indicates whether the evaluation passed
    score_raw: Optional[float] = None     # Raw numerical score of the evaluation
    metadata: Optional[dict[str, Any]] = None  # Additional structured data
    tags: Optional[dict[str, str]] = None      # String key-value pairs for labeling

When returning a boolean, float, or int, the framework automatically converts it to an EvaluationResult:

  • A boolean return value sets the pass_ field
  • A float return value sets the score_raw field

For more complex evaluations, you can return an EvaluationResult object directly, allowing you to provide additional context through metadata and tags. Note that the metadata field is only available locally and is not reported to the Patronus platform at this moment. You can access this data by exporting evaluation results to CSV after the evaluation is complete. If you need to report structured data to the platform, use the tags field instead.

Here's an example that demonstrates using various evaluator parameters:

from patronus import evaluator, Row, TaskResult, EvaluationResult


@evaluator(name="comprehensive-check", profile_name="response-quality")
def check_response_quality(row: Row, task_result: TaskResult) -> EvaluationResult:
    input_length = len(row.evaluated_model_input)
    timestamp = row.get("timestamp", "N/A")  # Access custom field with default
    
    model_version = task_result.metadata.get("model_version", "unknown")

    # Perform evaluation: your quality calculation
    quality_score = calculate_quality(task_result.evaluated_model_output)

    return EvaluationResult(
        score_raw=quality_score,
        pass_=quality_score > 0.7,
        metadata={"input_length": input_length, "timestamp": timestamp},
        tags={"model_version": model_version},
    )