Evaluator Benchmark Results

We continuously benchmark the performance of Patronus evaluators to deliver state-of-the-art automated evaluations. Below are a variety of benchmarks we have run our evaluators against, including open-source and proprietary datasets. Feel free to reach out to our team for more information about benchmark and evaluation procedures.

Hallucination

HaluBench is a hallucination evaluation benchmark with 15k samples. It covers various domains such as finance and medicine. We show that our in-house, faithfulness evaluator, Lynx outperforms GPT-4o and other closed and open-sourced models on our benchmark.

Model	HaluEval	RAGTruth	FinanceBench	DROP	CovidQA	PubMedQA	Overall
GPT-4o	87.9%	84.3%	85.3%	84.3%	95.0%	82.1%	86.5%
GPT-4-Turbo	86.0%	85.0%	82.2%	84.8%	90.6%	83.5%	85.0%
GPT-3.5-Turbo	62.2%	50.7%	60.9%	57.2%	56.7%	62.8%	58.7%
Claude-3-Sonnet	84.5%	79.1%	69.7%	84.3%	95.0%	82.9%	78.8%
Claude-3-Haiku	68.9%	78.9%	58.4%	84.3%	95.0%	82.9%	69.0%
RAGAS Faithfulness	70.6%	75.8%	59.5%	59.6%	75.0%	67.7%	66.9%
Mistral-Instruct-7B	78.3%	77.7%	56.3%	56.3%	71.7%	77.9%	69.4%
Llama-3-Instruct-8B	83.1%	80.0%	55.0%	58.2%	75.3%	70.7%	70.4%
Llama-3-Instruct-70B	87.0%	83.8%	72.7%	85.0%	85.0%	82.6%	80.1%
Lynx (8B)	85.7%	80.0%	72.5%	96.3%	96.3%	85.2%	82.9%
Lynx (70B)	88.4%	80.2%	81.4%	97.5%	97.5%	90.4%	87.4%

Long Context Faithfulness

Model	CovidQA (Long Context split)	FinanceBench (Long Context split)
hallucination-large-2024-07-23	94.60%	87.33%
RAGAS	56.%	NA*
Lynx (8b)	96%	76%
Llama-3.1-8B-Instruct	85.40%	65.33%
Llama-3.1-70B-Instruct-Turbo	89.40%	74.67%
GPT-4o-mini	85%	60.67%

*RAGAS fails to extract statements from the FinanceBench (Long Context split) as samples often contain a single number such as net revenue in the answer. RAGAS relies on decomposing answers into statements which it fails to do in such cases.

Answer Relevance

Dataset	answer_relevancy (ragas)	retrieval-answer-relevance
WikiQA (n=150)*	33.3%	86.7%
Factual Knowledge (n=100)	89.0%	100.0%

*WikiQA is available on Hugging Face

Context Relevance

Dataset	context_relevancy (ragas)	retrieval-context-relevance
Automotive QA (n=100)*	54.5%	83.5%

*Domain specific dataset constructed from real world documents

Judge Evaluator

Dataset	judge-large
World Knowledge QA (n=100)**	90.0%
Patronus Custom Criteria (n=102)	93.13%

*We use the following prompt for the analysis:

Your task is to score text on whether it passes the provided criteria. If yes, output 1. If no, output 0. You must respond with 1 or 0.

Input: {USER_INPUT}
Output: {MODEL_OUTPUT}
Label: {GOLD_ANSWER}
Criteria: {CRITERIA}

**Dataset constructed from customer provided criteria and user queries

Dataset	judge-small
owasp-llm01-prompt-injection (n=100)	89%
enterprise-pii-outputs (n=100)	88%
pii-outputs (n=150)	90.67%
criminal-planning (n=150)	88%

Toxicity Evaluator

Category	Patronus Toxicity	Perspective API	Llama-Guard-3-8B
Hate	88.27%	60.49%	72.84%
Sexual	96.62%	63.71%	87.34%
Violence	82.98%	55.32%	71.27%
CSAM	96.47%	72.94%	88.24%
Hate (Threatening)	95.12%	78.05%	90.24%
Overall	92.25%	63.81%	81.42%