Benchmarking

We take the performance of our evaluators very seriously and continuously benchmark performance to uphold our claim of providing state-of-the-art evaluators. Below are a variety of benchmarks we have run our evaluators against. Some of these are open-source while others are our proprietary datasets. We provide as much detail about the composition and setup of the benchmarks as we can and are always happy to discuss these in more detail.

Answer Relevance

Datasetanswer_relevancy (ragas)retrieval-answer-relevance-v2
WikiQA (n=150)*33.3%86.7%
Factual Knowledge (n=100)89.0%100.0%

*WikiQA is available on Hugging Face

Context Relevance

Datasetcontext_relevancy (ragas)retrieval-context-relevance-v1
Automotive QA (n=100)*54.5%83.5%

*Domain specific dataset constructed from real world automotive documents

Answer Hallucination (Faithfulness)

DatasetGPT-4 with Rigorous Prompting*retrieval-hallucination-v2
WikiQA -- ungrounded (n=50)70.0%90.0%
WikiQA -- grounded (n=50)70.0%86.0%
Hard Reasoning Scenarios (n=14)64.0%100.0%

*Some customers who come to us have a GPT-4 evaluator setup before realizing that performance is subpar. We routinely improve their evaluation performance by at least 20% by providing them with the best evaluators on the market.

Braintrust (ragas)LangChainStringEvaluator (GPT-4)Galileo ChainPoll (GPT-3.5)*retrieval-hallucination-v2
HaluEval (n=100)**71.0%47.0%72.0%87.0%
Factual Knowledge (n=100)63.0%77.0%59.0%93.0%

*Replication of ChainPoll with a Chain of Thought prompting based on Galileo's published work without access to their API

**HaluEval is available on Hugging Face.

HaluEval is a hallucination evaluation benchmark with 15k samples. It covers various domains such as finance and medicine. We show that our in-house, faithfulness evaluator, LYNX outperforms GPT-4o and other closed and open-sourced models on our benchmark.

ModelHaluEvalRAGTruthFinanceBenchDROPCovidQAPubMedQAOverall
GPT-4o87.9%84.3%85.3%84.3%95.0%82.1%86.5%
GPT-4-Turbo86.0%85.0%82.2%84.8%90.6%83.5%85.0%
GPT-3.5-Turbo62.2%50.7%60.9%57.2%56.7%62.8%58.7%
Claude-3-Sonnet84.5%79.1%69.7%84.3%95.0%82.9%78.8%
Claude-3-Haiku68.9%78.9%58.4%84.3%95.0%82.9%69.0%
RAGAS Faithfulness70.6%75.8%59.5%59.6%75.0%67.7%66.9%
Mistral-Instruct-7B78.3%77.7%56.3%56.3%71.7%77.9%69.4%
Llama-3-Instruct-8B83.1%80.0%55.0%58.2%75.3%70.7%70.4%
Llama-3-Instruct-70B87.0%83.8%72.7%85.0%85.0%82.6%80.1%
LYNX (8B)85.7%80.0%72.5%96.3%96.3%85.2%82.9%
LYNX (70B)88.4%80.2%81.4%97.5%97.5%90.4%87.4%

Custom Criteria

DatasetGPT-4 w/ Vanilla Prompting*retrieval-hallucination-v2
World Knowledge QA (n=100)**70.0%90.0%

*We use the following prompt for the analysis:

Your task is to score text on whether it passes the provided criteria. If yes, output 1. If no, output 0. You must respond with 1 or 0.

Input: {USER_INPUT}
Output: {MODEL_OUTPUT}
Label: {GOLD_ANSWER}
Criteria: {CRITERIA}

**Dataset constructed from customer provided criteria and user queries