What is Patronus AI?
Patronus AI is the leading tool to score and optimize Generative AI applications.
Patronus provides an end-to-end system to evaluate, monitor and improve performance of an LLM system, enabling developers to ship AI products safely and confidently.

Experimentation Framework: A/B test and optimize LLM system performance with experiments on different prompt, model, and data configurations
Real Time Monitoring: Monitor and receive real time alerts on LLM and agent interactions in production through tracing, logging, and alerts.
Visualizations and Analytics: Visualize performance of your AI applications, compare outputs side-by-side, and obtain insights to improve system performance over time.
Powerful Evaluation Models: Automatically catch hallucinations and unsafe outputs using our powerful suite of in-house evaluators through our Evaluation API, including Lynx, Glider, or define your own evaluator in our SDK.
Dataset Generation: Construct high quality custom datasets with our proprietary dataset generation algorithms for RAG, Agents, and other architectures. Automatically expose weaknesses in your AI systems with our redteaming algorithms.
Getting started
Core LLM evaluation concepts
Learn the taxonomy of LLM evaluation
Run a Patronus evaluation
Insert the Patronus API or SDK into existing workflows
Patronus Evaluators
Plug in turnkey metrics for RAG, Agents, NLP, OWASP, etc
Patronus experiments
Iterate on LLM system performance
Logging and alerting
View logs and receive alerts
Production LLM monitoring
Trace an evaluation, experiment, or other workflow
Read a guide
Start building
Evaluate agentic outputs
Evaluate tool selection and tool outputs
Evaluations with task chaining
Evaluate a multi-step workflow by chaining tasks
LLM-as-judges
Configure custom criteria for your use case
Test datasets for evaluations
Enterprise-grade dataset generation
Local evaluators
Bring your evaluators to Patronus's platform
Human-in-the-loop evaluations
Augment evaluations with annotations
Compare prompt and model performance
Data visualiazations and diffs on model outputs
Hallucination detection
Lynx: Patronus's finetuned hallucination detection model
Likert-based scoring
Glider: Patronus's finetuned rubric-based scoring model
Benchmark results
Model performance benchmarks
