Evaluator Benchmark Results
We continuously benchmark the performance of Patronus evaluators to deliver state-of-the-art automated evaluations. Below are a variety of benchmarks we have run our evaluators against, including open-source and proprietary datasets. Feel free to reach out to our team for more information about benchmark and evaluation procedures.
Hallucination
HaluBench is a hallucination evaluation benchmark with 15k samples. It covers various domains such as finance and medicine. We show that our in-house, faithfulness evaluator, Lynx outperforms GPT-4o and other closed and open-sourced models on our benchmark.
| Model | HaluEval | RAGTruth | FinanceBench | DROP | CovidQA | PubMedQA | Overall |
|---|---|---|---|---|---|---|---|
| GPT-4o | 87.9% | 84.3% | 85.3% | 84.3% | 95.0% | 82.1% | 86.5% |
| GPT-4-Turbo | 86.0% | 85.0% | 82.2% | 84.8% | 90.6% | 83.5% | 85.0% |
| GPT-3.5-Turbo | 62.2% | 50.7% | 60.9% | 57.2% | 56.7% | 62.8% | 58.7% |
| Claude-3-Sonnet | 84.5% | 79.1% | 69.7% | 84.3% | 95.0% | 82.9% | 78.8% |
| Claude-3-Haiku | 68.9% | 78.9% | 58.4% | 84.3% | 95.0% | 82.9% | 69.0% |
| RAGAS Faithfulness | 70.6% | 75.8% | 59.5% | 59.6% | 75.0% | 67.7% | 66.9% |
| Mistral-Instruct-7B | 78.3% | 77.7% | 56.3% | 56.3% | 71.7% | 77.9% | 69.4% |
| Llama-3-Instruct-8B | 83.1% | 80.0% | 55.0% | 58.2% | 75.3% | 70.7% | 70.4% |
| Llama-3-Instruct-70B | 87.0% | 83.8% | 72.7% | 85.0% | 85.0% | 82.6% | 80.1% |
| Lynx (8B) | 85.7% | 80.0% | 72.5% | 96.3% | 96.3% | 85.2% | 82.9% |
| Lynx (70B) | 88.4% | 80.2% | 81.4% | 97.5% | 97.5% | 90.4% | 87.4% |
Long Context Faithfulness
| Model | CovidQA (Long Context split) | FinanceBench (Long Context split) |
|---|---|---|
| hallucination-large-2024-07-23 | 94.60% | 87.33% |
| RAGAS | 56.% | NA* |
| Lynx (8b) | 96% | 76% |
| Llama-3.1-8B-Instruct | 85.40% | 65.33% |
| Llama-3.1-70B-Instruct-Turbo | 89.40% | 74.67% |
| GPT-4o-mini | 85% | 60.67% |
*RAGAS fails to extract statements from the FinanceBench (Long Context split) as samples often contain a single number such as net revenue in the answer. RAGAS relies on decomposing answers into statements which it fails to do in such cases.
Answer Relevance
| Dataset | answer_relevancy (ragas) | retrieval-answer-relevance |
|---|---|---|
| WikiQA (n=150)* | 33.3% | 86.7% |
| Factual Knowledge (n=100) | 89.0% | 100.0% |
*WikiQA is available on Hugging Face
Context Relevance
| Dataset | context_relevancy (ragas) | retrieval-context-relevance |
|---|---|---|
| Automotive QA (n=100)* | 54.5% | 83.5% |
*Domain specific dataset constructed from real world documents
Judge Evaluator
| Dataset | judge-large |
|---|---|
| World Knowledge QA (n=100)** | 90.0% |
| Patronus Custom Criteria (n=102) | 93.13% |
*We use the following prompt for the analysis:
**Dataset constructed from customer provided criteria and user queries
| Dataset | judge-small |
|---|---|
| owasp-llm01-prompt-injection (n=100) | 89% |
| enterprise-pii-outputs (n=100) | 88% |
| pii-outputs (n=150) | 90.67% |
| criminal-planning (n=150) | 88% |
Toxicity Evaluator
| Category | Patronus Toxicity | Perspective API | Llama-Guard-3-8B |
|---|---|---|---|
| Hate | 88.27% | 60.49% | 72.84% |
| Sexual | 96.62% | 63.71% | 87.34% |
| Violence | 82.98% | 55.32% | 71.27% |
| CSAM | 96.47% | 72.94% | 88.24% |
| Hate (Threatening) | 95.12% | 78.05% | 90.24% |
| Overall | 92.25% | 63.81% | 81.42% |
