Benchmarking

We take the performance of our evaluators very seriously and continuously benchmark performance to uphold our claim of providing state-of-the-art evaluators. Below are a variety of benchmarks we have run our evaluators against. Some of these are open-source while others are our proprietary datasets. We provide as much detail about the composition and setup of the benchmarks as we can and are always happy to discuss these in more detail.

Answer Relevance

Dataset	answer_relevancy (ragas)	retrieval-answer-relevance-v2
WikiQA (n=150)*	33.3%	86.7%
Factual Knowledge (n=100)	89.0%	100.0%

*WikiQA is available on Hugging Face

Context Relevance

Dataset	context_relevancy (ragas)	retrieval-context-relevance-v1
Automotive QA (n=100)*	54.5%	83.5%

*Domain specific dataset constructed from real world automotive documents

Answer Hallucination (Faithfulness)

Dataset	GPT-4 with Rigorous Prompting*	retrieval-hallucination-v2
WikiQA -- ungrounded (n=50)	70.0%	90.0%
WikiQA -- grounded (n=50)	70.0%	86.0%
Hard Reasoning Scenarios (n=14)	64.0%	100.0%

*Some customers who come to us have a GPT-4 evaluator setup before realizing that performance is subpar. We routinely improve their evaluation performance by at least 20% by providing them with the best evaluators on the market.

	Braintrust (ragas)	LangChainStringEvaluator (GPT-4)	Galileo ChainPoll (GPT-3.5)*	retrieval-hallucination-v2
HaluEval (n=100)**	71.0%	47.0%	72.0%	87.0%
Factual Knowledge (n=100)	63.0%	77.0%	59.0%	93.0%

*Replication of ChainPoll with a Chain of Thought prompting based on Galileo's published work without access to their API

**HaluEval is available on Hugging Face.

HaluEval is a hallucination evaluation benchmark with 15k samples. It covers various domains such as finance and medicine. We show that our in-house, faithfulness evaluator, LYNX outperforms GPT-4o and other closed and open-sourced models on our benchmark.

Model	HaluEval	RAGTruth	FinanceBench	DROP	CovidQA	PubMedQA	Overall
GPT-4o	87.9%	84.3%	85.3%	84.3%	95.0%	82.1%	86.5%
GPT-4-Turbo	86.0%	85.0%	82.2%	84.8%	90.6%	83.5%	85.0%
GPT-3.5-Turbo	62.2%	50.7%	60.9%	57.2%	56.7%	62.8%	58.7%
Claude-3-Sonnet	84.5%	79.1%	69.7%	84.3%	95.0%	82.9%	78.8%
Claude-3-Haiku	68.9%	78.9%	58.4%	84.3%	95.0%	82.9%	69.0%
RAGAS Faithfulness	70.6%	75.8%	59.5%	59.6%	75.0%	67.7%	66.9%
Mistral-Instruct-7B	78.3%	77.7%	56.3%	56.3%	71.7%	77.9%	69.4%
Llama-3-Instruct-8B	83.1%	80.0%	55.0%	58.2%	75.3%	70.7%	70.4%
Llama-3-Instruct-70B	87.0%	83.8%	72.7%	85.0%	85.0%	82.6%	80.1%
LYNX (8B)	85.7%	80.0%	72.5%	96.3%	96.3%	85.2%	82.9%
LYNX (70B)	88.4%	80.2%	81.4%	97.5%	97.5%	90.4%	87.4%

Custom Criteria

Dataset	GPT-4 w/ Vanilla Prompting*	retrieval-hallucination-v2
World Knowledge QA (n=100)**	70.0%	90.0%

*We use the following prompt for the analysis:

Your task is to score text on whether it passes the provided criteria. If yes, output 1. If no, output 0. You must respond with 1 or 0.

Input: {USER_INPUT}
Output: {MODEL_OUTPUT}
Label: {GOLD_ANSWER}
Criteria: {CRITERIA}

**Dataset constructed from customer provided criteria and user queries