Lynx

📢 Lynx v2.0 is an 8B State-of-the-Art RAG hallucination detection model 🚀

Lynx 2.0 was trained on long context data from real world domains like finance and medicine. Highlights:

Outperforms Claude-3.5-Sonnet as a judge on HaluBench by 2.2%

3.4% higher accuracy than Lynx v1.1 on HaluBench

First hallucination guardrail trained on long context financial data

Detects 8 types of common hallucinations, including Coreference Errors, Calculation Errors, CoT hallucinations, and more

Lynx is a State-of-the-Art hallucination detection model. Lynx is capable of advanced reasoning on challenging real-world hallucination scenarios. Lynx is novel for several reasons:

Trained on challenging perturbed data.
Trained on multiple real world domains, such as finance and medicine.
Outperforms GPT-4o, Claude-3-Sonnet and other open source models.

The Hallucination Problem

Deploying RAG applications to production can be risky due to hallucinations. They occur when LLM outputs are not faithful to the retrieved contexts. Detecting hallucinations requires nuanced and complex reasoning. LLMs hallucinate more in specialized domains such as finance and medicine. Real-time, automated hallucination detection is important to make RAG systems reliable. However, this is challenging to achieve.

Using GPT-4 and proprietary LLMs as judges to catch hallucinations is unreliable, costly and hard to scale.
Human evaluation is expensive and not feasible in real-time AI workflows.
Consequences of misinformation include incorrect medical diagnosis, poor financial advice, embarrassing outputs...and more!

Lynx

Lynx is the largest and first open source model that outperforms GPT-4o, Claude-3-Sonnet and LLama-3-70B on hallucination tasks. Lynx is available as 8B and 70B variants.

Lynx 2.0 is an 8B State-of-the-Art RAG hallucination detection model. It was trained on long context data from real world domains like finance and medicine.

Outperforms Claude-3.5-Sonnet as a judge on HaluBench by 2.2%
3.4% higher accuracy than Lynx v1.1 on HaluBench
First hallucination guardrail trained on long context financial data
Detects 8 types of common hallucinations, including Coreference Errors, Calculation Errors, CoT hallucinations, and more

To query Lynx for hallucination detection, simply run the following cURL request:

curl --location 'https://api.patronus.ai/v1/evaluate' \
--header 'X-API-KEY: YOUR_API_KEY' \
--header 'Content-Type: application/json' \
--data '{
  "app": "default",
  "evaluators": [
    {
      "evaluator": "lynx-small"
    }
  ],
  "evaluated_model_input": "Who are you?",
  "evaluated_model_output": "My name is Barry.",
  "evaluated_model_retrieved_context": ["I am John."],
  "tags": {"experiment": "question_answering"}
}'

To query the 70b model, you can use lynx-large in the evaluator field.

Evaluator	Context Length
lynx-small	128,000 tokens
lynx-large	8000 tokens

HaluBench

To evaluate Lynx, we present HaluBench , a comprehensive hallucination evaluation benchmark, consisting of 15k samples sourced from various real-world domains. It consists of a high proportion of hard-to-detect, ambiguous hallucinations. Our experiment results show that Lynx outperforms GPT-4o, Claude-3-Sonnet, and closed and open-source LLM-as-a-judge models on HaluBench. We released Lynx, HaluBench and our evaluation code for public access.

Model	HaluEval	RAGTruth	FinanceBench	DROP	CovidQA	PubmedQA	Overall
GPT-4o	87.9%	84.3%	85.3%	84.3%	95.0%	82.1%	86.5%
GPT-4-Turbo	86.0%	85.0%	82.2%	84.8%	90.6%	83.5%	85.0%
GPT-3.5-Turbo	62.2%	50.7%	60.9%	57.2%	56.7%	62.8%	58.7%
Claude-3.5-Sonnet	84.5%	79.1%	69.3%	69.7%	70.8%	84.8%	83.7%
RAGAS Faithfulness	70.6%	75.8%	59.5%	59.6%	75.0%	67.7%	66.9%
Mistral-Instruct-7B	78.3%	77.7%	56.3%	56.3%	71.7%	77.9%	69.4%
Llama-3-Instruct-8B	83.1%	80.0%	55.0%	58.2%	75.2%	70.7%	70.4%
Llama-3-Instruct-70B	87.0%	83.8%	72.7%	69.4%	85.0%	82.6%	80.1%
Lynx (8B)	89.0%	85.67%	72.1%	76.9%	96.0%	85.3%	84.3%
Lynx (70B)	88.4%	80.2%	81.4%	86.4%	97.5%	90.4%	87.4%

Lynx

The Hallucination Problem

Lynx

HaluBench

On this page