Evaluation API
Lynx 2.0 Guide
Lynx v2.0 is an 8B State-of-the-Art RAG hallucination detection model 🚀
Lynx 2.0 was trained on long context data from real world domains like finance and medicine.
- Lynx (8B) outperforms Claude-3.5-Sonnet as a judge on HaluBench by 2.2%
- Lynx (8B) shows 3.4% higher accuracy than Lynx v1.1 on HaluBench
- First hallucination guardrail trained on long context financial data
- Detects 8 types of common hallucinations, including Coreference Errors, Calculation Errors, CoT hallucinations, and more
Hallucination Taxonomy
Lynx 2.0 supports 8 kinds of hallucinations:
Hallucination Type | Definition |
---|---|
Predicate Error | The predicate in the model output is inconsistent with the retrieved context. |
Entity Error | The subject/object of a model output is inconsistent with the retrieved context. |
Circumstance Error | Time, duration, or location of an event in the model output is wrong |
Coreference Error | A pronoun/reference with wrong or nonexistent antecedent. |
Calculation Errors | The calculation to arrive at a numerical answer is incorrect. |
Chain of Thought Hallucinations | The chain of thought reasoning in a model output is unfaithful to the retrieved context. |
Partially grounded answers | Part of the answer is grounded in the retrieved context but the other part of the answer is not supported by the retrieved context. |
Unanswerable Questions | The question is not answerable using the retrieved context. |
Benchmark Performance
We extend HaluBench to include three additional datasets that capture the different types of hallucinations mentioned above. We include a long context dataset, QuALITY to capture long-context performance of the model. BUMP and squad capture additional types of hallucinations.
Model | BUMP | CovidQA | DROP | PubmedQA | QuALITY | RAGTruth | FinanceBench | squad | Average accuracy |
---|---|---|---|---|---|---|---|---|---|
meta-llama/Llama-3.2-3B-Instruct | 32.40% | 44.70% | 47.40% | 64.60% | 36.60% | 46.22% | 47.90% | 60.20% | 47.50% |
meta-llama/Llama-3.1-8B-Instruct | 64.20% | 83.00% | 65.30% | 80.50% | 54.60% | 76.67% | 59.70% | 86.00% | 71.26% |
GPT-4o mini | 73.00% | 87.20% | 80.30% | 84.20% | 59.60% | 81.88% | 81.60% | 81.80% | 78.71% |
Claude-3.5-Sonnet | 77.20% | 88.17% | 81.82% | 73.26% | 62.33% | 82.77% | 82.40% | 95.00% | 80.37% |
Lynx v1.1 (8B) | 75.00% | 96.90% | 77.80% | 88.90% | 61.00% | 80.11% | 76.70% | 76.80% | 79.15% |
Lynx v2.0 (8B) | 77.50% | 96.00% | 76.90% | 85.30% | 68.40% | 85.67% | 72.10% | 98.60% | 82.56% |
Lynx v1.0 (70B) | 71.00% | 97.50% | 86.40% | 90.40% | 63.20% | 80.22% | 81.40% | 87.60% | 82.22% |
How to Use Lynx 2.0
Python SDK
Install the Patronus SDK: