Concepts

What are evaluators?

Evaluators are automated functions that score the quality, safety, and performance of your LLM outputs. Think of them as automated graders - they take your model's output and tell you how well it performed on specific criteria.

Evaluators can assess different aspects of LLM performance:

Accuracy: Does the output correctly answer the question?
Safety: Does the output contain hallucinations, harmful content, or security vulnerabilities?
Quality: Is the output coherent, relevant, and well-structured?
Task-specific metrics: Custom criteria tailored to your use case

Types of evaluators

Patronus supports several types of evaluators to fit different needs:

Patronus evaluators

Pre-built evaluators powered by Patronus's proprietary models, optimized for accuracy and reliability:

Lynx: Advanced hallucination detection
Glider: Rubric-based scoring with customizable criteria
RAG evaluators: Specialized metrics for retrieval-augmented generation
OWASP evaluators: Security vulnerability detection

These evaluators are ready to use out of the box and cover common evaluation needs.

Judge evaluators

LLM-as-judge evaluators use language models to assess outputs based on custom criteria you define. These are ideal when you need subjective evaluation or domain-specific judgment.

When to use: Custom quality checks, subjective assessments, or when you need evaluation logic that requires reasoning.

Custom evaluators

Bring your own evaluation logic when you need something specific:

Function-based: Write Python functions for simple scoring logic
Class-based: Define evaluator classes for more complex logic
External: Integrate third-party evaluation tools

When to use: Custom non-LLM models, traditional code logic, or integrating existing evaluation code.

Multimodal evaluators

Patronus evaluators support multimodal inputs including text, images, audio, and video. This lets you evaluate:

Vision-language models (VLMs)
Image generation quality
Video/Image understanding tasks

See multimodal evaluations for details.

How evaluators work

Evaluators follow a simple process:

Input: You provide data to evaluate (prompts, outputs, reference answers, context)
Processing: The evaluator analyzes the input based on its criteria
Output: Returns scores, explanations, and metadata about the evaluation

Most evaluators also return explanations - human-readable reasoning for why they gave a particular score. This helps you understand not just what the score is, but why your output received that score.

Using evaluators

You can use evaluators in different contexts depending on your workflow:

Experiments: Run evaluators across entire datasets to compare model configurations
Real-time guardrails: Apply evaluators to production traffic as requests come in
Batch evaluations: Process large datasets asynchronously
Interactive testing: Test individual outputs in the UI for quick validation