Evaluator Benchmark Results
We continuously benchmark the performance of Patronus evaluators to deliver state-of-the-art automated evaluations. Below are a variety of benchmarks we have run our evaluators against, including open-source and proprietary datasets. Feel free to reach out to our team for more information about benchmark and evaluation procedures.
Hallucination
HaluBench is a hallucination evaluation benchmark with 15k samples. It covers various domains such as finance and medicine. We show that our in-house, faithfulness evaluator, Lynx outperforms GPT-4o and other closed and open-sourced models on our benchmark.
Model | HaluEval | RAGTruth | FinanceBench | DROP | CovidQA | PubMedQA | Overall |
---|---|---|---|---|---|---|---|
GPT-4o | 87.9% | 84.3% | 85.3% | 84.3% | 95.0% | 82.1% | 86.5% |
GPT-4-Turbo | 86.0% | 85.0% | 82.2% | 84.8% | 90.6% | 83.5% | 85.0% |
GPT-3.5-Turbo | 62.2% | 50.7% | 60.9% | 57.2% | 56.7% | 62.8% | 58.7% |
Claude-3-Sonnet | 84.5% | 79.1% | 69.7% | 84.3% | 95.0% | 82.9% | 78.8% |
Claude-3-Haiku | 68.9% | 78.9% | 58.4% | 84.3% | 95.0% | 82.9% | 69.0% |
RAGAS Faithfulness | 70.6% | 75.8% | 59.5% | 59.6% | 75.0% | 67.7% | 66.9% |
Mistral-Instruct-7B | 78.3% | 77.7% | 56.3% | 56.3% | 71.7% | 77.9% | 69.4% |
Llama-3-Instruct-8B | 83.1% | 80.0% | 55.0% | 58.2% | 75.3% | 70.7% | 70.4% |
Llama-3-Instruct-70B | 87.0% | 83.8% | 72.7% | 85.0% | 85.0% | 82.6% | 80.1% |
Lynx (8B) | 85.7% | 80.0% | 72.5% | 96.3% | 96.3% | 85.2% | 82.9% |
Lynx (70B) | 88.4% | 80.2% | 81.4% | 97.5% | 97.5% | 90.4% | 87.4% |
Long Context Faithfulness
Model | CovidQA (Long Context split) | FinanceBench (Long Context split) |
---|---|---|
hallucination-large-2024-07-23 | 94.60% | 87.33% |
RAGAS | 56.% | NA* |
Lynx (8b) | 96% | 76% |
Llama-3.1-8B-Instruct | 85.40% | 65.33% |
Llama-3.1-70B-Instruct-Turbo | 89.40% | 74.67% |
GPT-4o-mini | 85% | 60.67% |
*RAGAS fails to extract statements from the FinanceBench (Long Context split) as samples often contain a single number such as net revenue in the answer. RAGAS relies on decomposing answers into statements which it fails to do in such cases.
Answer Relevance
Dataset | answer_relevancy (ragas) | retrieval-answer-relevance |
---|---|---|
WikiQA (n=150)* | 33.3% | 86.7% |
Factual Knowledge (n=100) | 89.0% | 100.0% |
*WikiQA is available on Hugging Face
Context Relevance
Dataset | context_relevancy (ragas) | retrieval-context-relevance |
---|---|---|
Automotive QA (n=100)* | 54.5% | 83.5% |
*Domain specific dataset constructed from real world documents
Judge Evaluator
Dataset | judge-large |
---|---|
World Knowledge QA (n=100)** | 90.0% |
Patronus Custom Criteria (n=102) | 93.13% |
*We use the following prompt for the analysis:
**Dataset constructed from customer provided criteria and user queries
Dataset | judge-small |
---|---|
owasp-llm01-prompt-injection (n=100) | 89% |
enterprise-pii-outputs (n=100) | 88% |
pii-outputs (n=150) | 90.67% |
criminal-planning (n=150) | 88% |
Toxicity Evaluator
Category | Patronus Toxicity | Perspective API | Llama-Guard-3-8B |
---|---|---|---|
Hate | 88.27% | 60.49% | 72.84% |
Sexual | 96.62% | 63.71% | 87.34% |
Violence | 82.98% | 55.32% | 71.27% |
CSAM | 96.47% | 72.94% | 88.24% |
Hate (Threatening) | 95.12% | 78.05% | 90.24% |
Overall | 92.25% | 63.81% | 81.42% |