Hallucination Detection (Lynx)
Lynx is a State-of-the-Art hallucination detection model. Lynx is capable of advanced reasoning on challenging real-world hallucination scenarios. Lynx is novel for several reasons:
- Trained on challenging perturbed data.
- Trained on multiple real world domains, such as finance and medicine.
- Outperforms GPT-4o, Claude-3-Sonnet and other open source models.
The Hallucination Problem
Deploying RAG applications to production can be risky due to hallucinations. They occur when LLM outputs are not faithful to the retrieved contexts. Detecting hallucinations requires nuanced and complex reasoning. LLMs hallucinate more in specialized domains such as finance and medicine. Real-time, automated hallucination detection is important to make RAG systems reliable. However, this is challenging to achieve.
- Using GPT-4 and proprietary LLMs as judges to catch hallucinations is unreliable, costly and hard to scale.
- Human evaluation is expensive and not feasible in real-time AI workflows.
- Consequences of misinformation include incorrect medical diagnosis, poor financial advice, embarrassing outputs...and more!
Lynx
Lynx is the largest and first open source model that outperforms GPT-4o, Claude-3-Sonnet and LLama-3-70B on hallucination tasks. Lynx is available as 8B and 70B variants. The 8B parameter variant only has a 3% performance gap compared to GPT-4o.
To use Lynx, you can make the following request:
curl --location 'https://api.patronus.ai/v1/evaluate' \
--header 'X-API-KEY: YOUR_API_KEY' \
--header 'Content-Type: application/json' \
--data '{
"app": "default",
"evaluators": [
{
"evaluator": "retrieval-hallucination-lynx",
"explain_strategy": "always"
}
],
"evaluated_model_input": "Who are you?",
"evaluated_model_output": "My name is Barry.",
"evaluated_model_retrieved_context": ["I am John."],
"tags": {"experiment": "question_answering"}
}'
HaluBench
To evaluate Lynx, we present HaluBench , a comprehensive hallucination evaluation benchmark, consisting of 15k samples sourced from various real-world domains. It consists of a high proportion of hard-to-detect, ambiguous hallucinations. Our experiment results show that Lynx outperforms GPT-4o, Claude-3-Sonnet, and closed and open-source LLM-as-a-judge models on HaluBench. We released Lynx, HaluBench and our evaluation code for public access.
Model | HaluEval | RAGTruth | FinanceBench | DROP | CovidQA | PubmedQA | Overall |
---|---|---|---|---|---|---|---|
GPT-4o | 87.9% | 84.3% | 85.3% | 84.3% | 95.0% | 82.1% | 86.5% |
GPT-4-Turbo | 86.0% | 85.0% | 82.2% | 84.8% | 90.6% | 83.5% | 85.0% |
GPT-3.5-Turbo | 62.2% | 50.7% | 60.9% | 57.2% | 56.7% | 62.8% | 58.7% |
Claude-3.5-Sonnet | 84.5% | 79.1% | 69.3% | 69.7% | 70.8% | 84.8% | 83.7% |
RAGAS Faithfulness | 70.6% | 75.8% | 59.5% | 59.6% | 75.0% | 67.7% | 66.9% |
Mistral-Instruct-7B | 78.3% | 77.7% | 56.3% | 56.3% | 71.7% | 77.9% | 69.4% |
Llama-3-Instruct-8B | 83.1% | 80.0% | 55.0% | 58.2% | 75.2% | 70.7% | 70.4% |
Llama-3-Instruct-70B | 87.0% | 83.8% | 72.7% | 69.4% | 85.0% | 82.6% | 80.1% |
Lynx (8B) | 87.3% | 79.9% | 75.6% | 77.5% | 96.9% | 88.9% | 84.3% |
Lynx (70B) | 88.4% | 80.2% | 81.4% | 86.4% | 97.5% | 90.4% | 87.4% |
Updated 23 days ago