Our Python SDK got smarter. We developed a Typscript SDK too. We are updating our SDK code blocks. Python SDKhere.Typscript SDKhere.
Description
Research and Differentiators

Evaluator Benchmark Results

We continuously benchmark the performance of Patronus evaluators to deliver state-of-the-art automated evaluations. Below are a variety of benchmarks we have run our evaluators against, including open-source and proprietary datasets. Feel free to reach out to our team for more information about benchmark and evaluation procedures.

Hallucination

HaluBench is a hallucination evaluation benchmark with 15k samples. It covers various domains such as finance and medicine. We show that our in-house, faithfulness evaluator, Lynx outperforms GPT-4o and other closed and open-sourced models on our benchmark.

ModelHaluEvalRAGTruthFinanceBenchDROPCovidQAPubMedQAOverall
GPT-4o87.9%84.3%85.3%84.3%95.0%82.1%86.5%
GPT-4-Turbo86.0%85.0%82.2%84.8%90.6%83.5%85.0%
GPT-3.5-Turbo62.2%50.7%60.9%57.2%56.7%62.8%58.7%
Claude-3-Sonnet84.5%79.1%69.7%84.3%95.0%82.9%78.8%
Claude-3-Haiku68.9%78.9%58.4%84.3%95.0%82.9%69.0%
RAGAS Faithfulness70.6%75.8%59.5%59.6%75.0%67.7%66.9%
Mistral-Instruct-7B78.3%77.7%56.3%56.3%71.7%77.9%69.4%
Llama-3-Instruct-8B83.1%80.0%55.0%58.2%75.3%70.7%70.4%
Llama-3-Instruct-70B87.0%83.8%72.7%85.0%85.0%82.6%80.1%
Lynx (8B)85.7%80.0%72.5%96.3%96.3%85.2%82.9%
Lynx (70B)88.4%80.2%81.4%97.5%97.5%90.4%87.4%

Long Context Faithfulness

ModelCovidQA (Long Context split)FinanceBench (Long Context split)
hallucination-large-2024-07-2394.60%87.33%
RAGAS56.%NA*
Lynx (8b)96%76%
Llama-3.1-8B-Instruct85.40%65.33%
Llama-3.1-70B-Instruct-Turbo89.40%74.67%
GPT-4o-mini85%60.67%

*RAGAS fails to extract statements from the FinanceBench (Long Context split) as samples often contain a single number such as net revenue in the answer. RAGAS relies on decomposing answers into statements which it fails to do in such cases.

Answer Relevance

Datasetanswer_relevancy (ragas)retrieval-answer-relevance
WikiQA (n=150)*33.3%86.7%
Factual Knowledge (n=100)89.0%100.0%

*WikiQA is available on Hugging Face

Context Relevance

Datasetcontext_relevancy (ragas)retrieval-context-relevance
Automotive QA (n=100)*54.5%83.5%

*Domain specific dataset constructed from real world documents

Judge Evaluator

Datasetjudge-large
World Knowledge QA (n=100)**90.0%
Patronus Custom Criteria (n=102)93.13%

*We use the following prompt for the analysis:

Your task is to score text on whether it passes the provided criteria. If yes, output 1. If no, output 0. You must respond with 1 or 0.

Input: {USER_INPUT}
Output: {MODEL_OUTPUT}
Label: {GOLD_ANSWER}
Criteria: {CRITERIA}

**Dataset constructed from customer provided criteria and user queries

Datasetjudge-small
owasp-llm01-prompt-injection (n=100)89%
enterprise-pii-outputs (n=100)88%
pii-outputs (n=150)90.67%
criminal-planning (n=150)88%

Toxicity Evaluator

CategoryPatronus ToxicityPerspective APILlama-Guard-3-8B
Hate88.27%60.49%72.84%
Sexual96.62%63.71%87.34%
Violence82.98%55.32%71.27%
CSAM96.47%72.94%88.24%
Hate (Threatening)95.12%78.05%90.24%
Overall92.25%63.81%81.42%

On this page