Our Python SDK got smarter. We developed a Typscript SDK too. We are updating our SDK code blocks. Python SDKhere.Typscript SDKhere.
Description
TutorialsEvals

Evaluators

Understanding and using evaluators in Patronus

Note: For comprehensive API documentation and detailed examples, please refer to the Python SDK Documentation.

What are Evaluators?

In Patronus, evaluators assess the quality of AI outputs. They help you:

  • Measure model performance across various metrics
  • Detect potential issues like hallucinations or safety concerns
  • Compare different models or prompt strategies
  • Monitor production AI systems for quality

Each evaluation produces a result that typically includes:

  • A score (usually 0-1)
  • A binary pass/fail result
  • An optional explanation
  • Additional metadata

Types of Evaluators

Patronus supports three main types of evaluators:

1. Function-Based Evaluators

Simple Python functions decorated with @evaluator().

2. Class-Based Evaluators

More sophisticated evaluators created by extending base classes:

  • StructuredEvaluator for synchronous evaluation
  • AsyncStructuredEvaluator for asynchronous operations
  • Can maintain state and use external resources

Structured evaluators follows the same interface as Patronus Evaluators. They also work out of the box with experiments.

3. Patronus Evaluators

Powerful pre-built evaluators that run on Patronus infrastructure:

  • lynx for hallucination detection
  • judge for quality assessments
  • And many more specialized evaluators

Creating Custom Evaluators

The following example uses the transformers library from Hugging Face. Install it with pip install transformers torch before running this code.

Function-Based Example

import numpy as np
from transformers import BertTokenizer, BertModel
 
from patronus import evaluator, init, EvaluationResult
 
init()
 
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")
 
 
@evaluator()
def bert_score(task_output: str, gold_answer: str, pass_threshold: float = 0.8) -> EvaluationResult:
 
    # Tokenize and get embeddings
    output_toks = tokenizer(task_output, return_tensors="pt", padding=True, truncation=True)
    gold_answer_toks = tokenizer(gold_answer, return_tensors="pt", padding=True, truncation=True)
 
    output_embeds = model(**output_toks).last_hidden_state.mean(dim=1).detach().numpy()
    gold_answer_embeds = model(**gold_answer_toks).last_hidden_state.mean(dim=1).detach().numpy()
 
    # Calculate cosine similarity
    score = np.dot(output_embeds, gold_answer_embeds.T) / (
        np.linalg.norm(output_embeds) * np.linalg.norm(gold_answer_embeds)
    )
 
    # Convert to scalar if necessary
    if hasattr(score, 'item'):
        score = score.item()
 
    return EvaluationResult(
        score=score,
        pass_=score >= pass_threshold,
        tags={"pass_threshold": str(pass_threshold)},
    )
 
# Calling the evaluator function will automatically record the result to the platform.
result = bert_score(task_output="What is the capital of France?", gold_answer="Paris", pass_threshold=0.6)
result.pretty_print()

Class-Based Example

Class-based evaluators provide a structured way to create complex evaluators with initialization steps and state management. They follow the standard Patronus evaluator interface while allowing more sophisticated implementations.

Key advantages of class-based evaluators:

  • Encapsulate initialization steps like loading models
  • Separate configuration from evaluation logic
import numpy as np
from transformers import BertTokenizer, BertModel
 
from patronus import StructuredEvaluator, EvaluationResult
from patronus.experiments import run_experiment
 
 
class BERTScore(StructuredEvaluator):
    def __init__(self, pass_threshold: float):
        self.pass_threshold = pass_threshold
        self.tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
        self.model = BertModel.from_pretrained("bert-base-uncased")
 
    def evaluate(self, *, task_output: str, gold_answer: str, **kwargs) -> EvaluationResult:
        output_toks = self.tokenizer(task_output, return_tensors="pt", padding=True, truncation=True)
        gold_answer_toks = self.tokenizer(gold_answer, return_tensors="pt", padding=True, truncation=True)
 
        output_embeds = self.model(**output_toks).last_hidden_state.mean(dim=1).detach().numpy()
        gold_answer_embeds = self.model(**gold_answer_toks).last_hidden_state.mean(dim=1).detach().numpy()
 
        score = np.dot(output_embeds, gold_answer_embeds.T) / (
            np.linalg.norm(output_embeds) * np.linalg.norm(gold_answer_embeds)
        )
 
        return EvaluationResult(
            score=score,
            pass_=score >= self.pass_threshold,
            tags={"pass_threshold": str(self.pass_threshold)},
        )
 
 
bert_scorer = BERTScore(pass_threshold=0.6)
 
# Calling the evaluator function will automatically record the result to the platform.
result = bert_scorer.evaluate(task_output="What is the capital of France?", gold_answer="Paris")
result.pretty_print()

Using Remote Evaluators

from patronus import init
from patronus.evals import RemoteEvaluator
 
init()
 
hallucination_checker = RemoteEvaluator("lynx", "patronus:hallucination")
 
result = hallucination_checker.evaluate(
    task_input="What's the largest animal?",
    task_output="The blue whale is the largest animal, weighing up to 200 tons.",
    task_context="The blue whale is the largest animal on Earth, weighing up to 173 tons."
)
 
result.pretty_print()

Integration with Tracing

When the Patronus SDK is initialized with init(), evaluators are automatically tracked in the Patronus platform:

from patronus import init, traced, evaluator
 
init()
 
@evaluator()
def sentiment_check(text: str) -> float:
    # Simplified sentiment scoring
    positive_words = ["good", "great", "excellent"]
    negative_words = ["bad", "poor", "terrible"]
    
    text_lower = text.lower()
    positive_count = sum(1 for word in positive_words if word in text_lower)
    negative_count = sum(1 for word in negative_words if word in text_lower)
    
    if positive_count + negative_count == 0:
        return 0.5  # Neutral
    
    return positive_count / (positive_count + negative_count)
 
@traced()
def generate_review_summary(review: str) -> dict:
    summary = f"Summary of: {review[:50]}..."
    sentiment = sentiment_check(review)
    
    return {
        "summary": summary,
        "sentiment": sentiment,
        "is_positive": sentiment > 0.6
    }

For more detailed examples and advanced configurations, please refer to the Python SDK Documentation.

On this page