Evaluator Benchmark Results

We continuously benchmark the performance of Patronus evaluators to deliver state-of-the-art automated evaluations. Below are a variety of benchmarks we have run our evaluators against, including open-source and proprietary datasets. Feel free to reach out to our team for more information about benchmark and evaluation procedures.

Hallucination

HaluBench is a hallucination evaluation benchmark with 15k samples. It covers various domains such as finance and medicine. We show that our in-house, faithfulness evaluator, Lynx outperforms GPT-4o and other closed and open-sourced models on our benchmark.

ModelHaluEvalRAGTruthFinanceBenchDROPCovidQAPubMedQAOverall
GPT-4o87.9%84.3%85.3%84.3%95.0%82.1%86.5%
GPT-4-Turbo86.0%85.0%82.2%84.8%90.6%83.5%85.0%
GPT-3.5-Turbo62.2%50.7%60.9%57.2%56.7%62.8%58.7%
Claude-3-Sonnet84.5%79.1%69.7%84.3%95.0%82.9%78.8%
Claude-3-Haiku68.9%78.9%58.4%84.3%95.0%82.9%69.0%
RAGAS Faithfulness70.6%75.8%59.5%59.6%75.0%67.7%66.9%
Mistral-Instruct-7B78.3%77.7%56.3%56.3%71.7%77.9%69.4%
Llama-3-Instruct-8B83.1%80.0%55.0%58.2%75.3%70.7%70.4%
Llama-3-Instruct-70B87.0%83.8%72.7%85.0%85.0%82.6%80.1%
Lynx (8B)85.7%80.0%72.5%96.3%96.3%85.2%82.9%
Lynx (70B)88.4%80.2%81.4%97.5%97.5%90.4%87.4%

Long Context Faithfulness


ModelCovidQA (Long Context split)FinanceBench (Long Context split)
hallucination-large-2024-07-2394.60%87.33%
RAGAS56.%NA*
Lynx (8b)96%76%
Llama-3.1-8B-Instruct85.40%65.33%
Llama-3.1-70B-Instruct-Turbo89.40%74.67%
GPT-4o-mini85%60.67%

*RAGAS fails to extract statements from the FinanceBench (Long Context split) as samples often contain a single number such as net revenue in the answer. RAGAS relies on decomposing answers into statements which it fails to do in such cases.

Answer Relevance

Datasetanswer_relevancy (ragas)retrieval-answer-relevance
WikiQA (n=150)*33.3%86.7%
Factual Knowledge (n=100)89.0%100.0%

*WikiQA is available on Hugging Face


Context Relevance

Datasetcontext_relevancy (ragas)retrieval-context-relevance
Automotive QA (n=100)*54.5%83.5%


*Domain specific dataset constructed from real world documents

Judge Evaluator

Datasetjudge-large
World Knowledge QA (n=100)**90.0%
Patronus Custom Criteria (n=102)93.13%

*We use the following prompt for the analysis:

Your task is to score text on whether it passes the provided criteria. If yes, output 1. If no, output 0. You must respond with 1 or 0.

Input: {USER_INPUT}
Output: {MODEL_OUTPUT}
Label: {GOLD_ANSWER}
Criteria: {CRITERIA}

**Dataset constructed from customer provided criteria and user queries


Datasetjudge-small
owasp-llm01-prompt-injection (n=100)89%
enterprise-pii-outputs (n=100)88%
pii-outputs (n=150)90.67%
criminal-planning (n=150)88%


Toxicity Evaluator

CategoryPatronus ToxicityPerspective APILlama-Guard-3-8B
Hate88.27%60.49%72.84%
Sexual96.62%63.71%87.34%
Violence82.98%55.32%71.27%
CSAM96.47%72.94%88.24%
Hate (Threatening)95.12%78.05%90.24%
Overall92.25%63.81%81.42%