What is Patronus AI?
Patronus AI is the leading tool to score and optimize Generative AI applications.
Patronus provides an end-to-end system to evaluate, monitor and improve performance of an LLM system, enabling developers to ship AI products safely and confidently.

Experimentation Framework: A/B test and optimize LLM system performance with experiments on different prompt, model, and data configurations
Real Time Monitoring: Monitor and receive real time alerts on LLM and agent interactions in production through tracing, logging, and alerts.
Visualizations and Analytics: Visualize performance of your AI applications, compare outputs side-by-side, and obtain insights to improve system performance over time.
Powerful Evaluation Models: Automatically catch hallucinations and unsafe outputs using our powerful suite of in-house evaluators through our Evaluation API, including Lynx, Glider, or define your own evaluator in our SDK.
Dataset Generation: Construct high quality custom datasets with our proprietary dataset generation algorithms for RAG, Agents, and other architectures. Automatically expose weaknesses in your AI systems with our redteaming algorithms.
Getting started
Core LLM evaluation concepts
Learn the taxonomy of LLM evaluation
Run a Patronus evaluation
Insert the Patronus API or SDK into existing workflows
Patronus Evaluators
Plug in turnkey metrics for RAG, Agents, NLP, OWASP, etc
Patronus experiments
Iterate on LLM system performance
Debug agent failures
Identify and fix failures in agentic systems
Production LLM monitoring
Trace and monitor your LLM applications
Read a guide
Start building
Debug agent failures
Identify and fix failures in agentic systems
Benchmark models
Compare model performance across your use case
Build evals with Percival Chat
Use AI assistance to create custom evaluators
Custom error taxonomy
Define domain-specific error categories
Evaluate RAG applications
Test retrieval quality and answer accuracy
Add guardrails
Implement safety checks in your application
LLM-as-judges
Configure custom criteria for your use case
Prompt management
Version and deploy prompts as code
Human-in-the-loop annotations
Augment evaluations with human feedback
Generate test datasets
Create datasets for your domain
