Concepts
Understanding evaluators in Patronus AI
What are evaluators?
Evaluators are automated functions that score the quality, safety, and performance of your LLM outputs. Think of them as automated graders - they take your model's output and tell you how well it performed on specific criteria.
Evaluators can assess different aspects of LLM performance:
- Accuracy: Does the output correctly answer the question?
- Safety: Does the output contain hallucinations, harmful content, or security vulnerabilities?
- Quality: Is the output coherent, relevant, and well-structured?
- Task-specific metrics: Custom criteria tailored to your use case
Types of evaluators
Patronus supports several types of evaluators to fit different needs:
Patronus evaluators
Pre-built evaluators powered by Patronus's proprietary models, optimized for accuracy and reliability:
- Lynx: Advanced hallucination detection
- Glider: Rubric-based scoring with customizable criteria
- RAG evaluators: Specialized metrics for retrieval-augmented generation
- OWASP evaluators: Security vulnerability detection
These evaluators are ready to use out of the box and cover common evaluation needs.
Judge evaluators
LLM-as-judge evaluators use language models to assess outputs based on custom criteria you define. These are ideal when you need subjective evaluation or domain-specific judgment.
When to use: Custom quality checks, subjective assessments, or when you need evaluation logic that requires reasoning.
Custom evaluators
Bring your own evaluation logic when you need something specific:
- Function-based: Write Python functions for simple scoring logic
- Class-based: Define evaluator classes for more complex logic
- External: Integrate third-party evaluation tools
When to use: Custom non-LLM models, traditional code logic, or integrating existing evaluation code.
Multimodal evaluators
Patronus evaluators support multimodal inputs including text, images, audio, and video. This lets you evaluate:
- Vision-language models (VLMs)
- Image generation quality
- Video/Image understanding tasks
See multimodal evaluations for details.
How evaluators work
Evaluators follow a simple process:
- Input: You provide data to evaluate (prompts, outputs, reference answers, context)
- Processing: The evaluator analyzes the input based on its criteria
- Output: Returns scores, explanations, and metadata about the evaluation
Most evaluators also return explanations - human-readable reasoning for why they gave a particular score. This helps you understand not just what the score is, but why your output received that score.
Using evaluators
You can use evaluators in different contexts depending on your workflow:
- Experiments: Run evaluators across entire datasets to compare model configurations
- Real-time guardrails: Apply evaluators to production traffic as requests come in
- Batch evaluations: Process large datasets asynchronously
- Interactive testing: Test individual outputs in the UI for quick validation
Next steps
- Learn about Patronus evaluators
- Browse the evaluator reference guide
- Create custom evaluators with Judge
- Explore multimodal evaluations
