Evaluator weights

Evaluator weights are only supported when using evaluators within the experiment framework. This feature is not available for standalone evaluator usage.

You can assign weights to evaluators to indicate their relative importance in your evaluation strategy. Weights can be provided as either strings or floats representing valid decimal numbers and are automatically stored as experiment metadata.

Weights work consistently across all evaluator types but are configured differently depending on whether you're using remote evaluators, function-based evaluators, or class-based evaluators.

Weight support by evaluator type

Each evaluator type handles weight configuration differently:

Remote evaluators

For remote evaluators, pass the weight parameter directly to the RemoteEvaluator constructor:

from patronus.evals import RemoteEvaluator
from patronus.experiments import run_experiment
 
# Remote evaluator with weight (string or float)
pii_evaluator = RemoteEvaluator("pii", "patronus:pii:1", weight="0.6")
conciseness_evaluator = RemoteEvaluator("judge", "patronus:is-concise", weight=0.4)
 
experiment = run_experiment(
    dataset=dataset,
    task=my_task,
    evaluators=[pii_evaluator, conciseness_evaluator]
)

Function-based evaluators

For function-based evaluators, pass the weight parameter to the FuncEvaluatorAdapter that wraps your evaluator function:

from patronus import evaluator
from patronus.experiments import FuncEvaluatorAdapter, run_experiment
from patronus.datasets import Row
 
@evaluator()
def exact_match(row: Row, **kwargs) -> bool:
    return row.task_output.lower().strip() == row.gold_answer.lower().strip()
 
# Function evaluator with weight (string or float)
exact_match_weighted = FuncEvaluatorAdapter(exact_match, weight=0.7)
 
experiment = run_experiment(
    dataset=dataset,
    task=my_task,
    evaluators=[exact_match_weighted]
)

Class-based evaluators

For class-based evaluators, pass the weight parameter to your evaluator's constructor and ensure it's passed to the parent class:

from typing import Union
from patronus import StructuredEvaluator, EvaluationResult
from patronus.experiments import run_experiment
 
class CustomEvaluator(StructuredEvaluator):
    def __init__(self, threshold: float, weight: Union[str, float] = None):
        super().__init__(weight=weight)  # Pass to parent class
        self.threshold = threshold
 
    def evaluate(self, *, task_output: str, **kwargs) -> EvaluationResult:
        score = len(task_output) / 100  # Simple length-based scoring
        return EvaluationResult(
            score=score,
            pass_=score >= self.threshold
        )
 
# Class-based evaluator with weight (string or float)
custom_evaluator = CustomEvaluator(threshold=0.5, weight=0.3)
 
experiment = run_experiment(
    dataset=dataset,
    task=my_task,
    evaluators=[custom_evaluator]
)

Complete example

Here's a comprehensive example demonstrating weighted evaluators of all three types:

from patronus.experiments import FuncEvaluatorAdapter, run_experiment
from patronus import RemoteEvaluator, EvaluationResult, StructuredEvaluator, evaluator
from patronus.datasets import Row
 
class DummyEvaluator(StructuredEvaluator):
    def evaluate(self, task_output: str, gold_answer: str, **kwargs) -> EvaluationResult:
        return EvaluationResult(score_raw=1, pass_=True)
 
@evaluator
def exact_match(row: Row, **kwargs) -> bool:
    return row.task_output.lower().strip() == row.gold_answer.lower().strip()
 
experiment = run_experiment(
    project_name="Weighted Evaluation Example",
    dataset=[
        {
            "task_input": "Please provide your contact details.",
            "task_output": "My email is john.doe@example.com and my phone number is 123-456-7890.",
            "gold_answer": "My email is john.doe@example.com and my phone number is 123-456-7890.",
        },
        {
            "task_input": "Share your personal information.",
            "task_output": "My name is Jane Doe and I live at 123 Elm Street.",
            "gold_answer": "My name is Jane Doe and I live at 123 Elm Street.",
        },
    ],
    evaluators=[
        RemoteEvaluator("pii", "patronus:pii:1", weight="0.3"), # Remote evaluator with string weight
        FuncEvaluatorAdapter(exact_match, weight="0.3"), # Function evaluator with string weight
        DummyEvaluator(weight="0.4"), # Class evaluator with string weight
    ],
    experiment_name="Weighted Evaluators Demo"
)

Weight validation and rules

When using evaluator weights, keep these rules in mind:

Experiments only: Weights are exclusively available within the experiment framework - they cannot be used with standalone evaluator calls
Valid format: Weights must be valid decimal numbers provided as either strings or floats (e.g., "0.3", 1.0, 0.7)
Consistency: The same evaluator (identified by its canonical name) cannot have different weights within the same experiment
Automatic storage: Weights are automatically collected and stored in the experiment's metadata under the "evaluator_weights" key
Optional: Weights are optional - evaluators without weights will simply not have weight metadata stored
Best practice: Consider making weights sum to 1.0 for clearer interpretation of relative importance

Error examples

Here are common mistakes to avoid:

# Invalid weight format - will raise TypeError
RemoteEvaluator("judge", "patronus:is-concise", weight="invalid")
RemoteEvaluator("judge", "patronus:is-concise", weight=[1, 2, 3])  # Lists not supported
 
# Inconsistent weights for same evaluator - will raise TypeError during experiment
run_experiment(
    dataset=dataset,
    task=my_task,
    evaluators=[
        RemoteEvaluator("judge", "patronus:is-concise", weight=0.7),
        RemoteEvaluator("judge", "patronus:is-concise", weight="0.3"),  # Different weight!
    ]
)