Description

Lynx

📢 Lynx v2.0 is an 8B State-of-the-Art RAG hallucination detection model 🚀

Lynx 2.0 was trained on long context data from real world domains like finance and medicine. Highlights:

  • Outperforms Claude-3.5-Sonnet as a judge on HaluBench by 2.2%
  • 3.4% higher accuracy than Lynx v1.1 on HaluBench
  • First hallucination guardrail trained on long context financial data
  • Detects 8 types of common hallucinations, including Coreference Errors, Calculation Errors, CoT hallucinations, and more

Lynx is a State-of-the-Art hallucination detection model. Lynx is capable of advanced reasoning on challenging real-world hallucination scenarios. Lynx is novel for several reasons:

  • Trained on challenging perturbed data.
  • Trained on multiple real world domains, such as finance and medicine.
  • Outperforms GPT-4o, Claude-3-Sonnet and other open source models.

The Hallucination Problem

Deploying RAG applications to production can be risky due to hallucinations. They occur when LLM outputs are not faithful to the retrieved contexts. Detecting hallucinations requires nuanced and complex reasoning. LLMs hallucinate more in specialized domains such as finance and medicine. Real-time, automated hallucination detection is important to make RAG systems reliable. However, this is challenging to achieve.

  • Using GPT-4 and proprietary LLMs as judges to catch hallucinations is unreliable, costly and hard to scale.
  • Human evaluation is expensive and not feasible in real-time AI workflows.
  • Consequences of misinformation include incorrect medical diagnosis, poor financial advice, embarrassing outputs...and more!

Lynx

Lynx is the largest and first open source model that outperforms GPT-4o, Claude-3-Sonnet and LLama-3-70B on hallucination tasks. Lynx is available as 8B and 70B variants.

Lynx 2.0 is an 8B State-of-the-Art RAG hallucination detection model. It was trained on long context data from real world domains like finance and medicine.

  • Outperforms Claude-3.5-Sonnet as a judge on HaluBench by 2.2%
  • 3.4% higher accuracy than Lynx v1.1 on HaluBench
  • First hallucination guardrail trained on long context financial data
  • Detects 8 types of common hallucinations, including Coreference Errors, Calculation Errors, CoT hallucinations, and more

To query Lynx for hallucination detection, simply run the following cURL request:

curl --location 'https://api.patronus.ai/v1/evaluate' \
--header 'X-API-KEY: YOUR_API_KEY' \
--header 'Content-Type: application/json' \
--data '{
  "app": "default",
  "evaluators": [
    {
      "evaluator": "lynx-small"
    }
  ],
  "evaluated_model_input": "Who are you?",
  "evaluated_model_output": "My name is Barry.",
  "evaluated_model_retrieved_context": ["I am John."],
  "tags": {"experiment": "question_answering"}
}'

To query the 70b model, you can use lynx-large in the evaluator field.

EvaluatorContext Length
lynx-small128,000 tokens
lynx-large8000 tokens

HaluBench

To evaluate Lynx, we present HaluBench , a comprehensive hallucination evaluation benchmark, consisting of 15k samples sourced from various real-world domains. It consists of a high proportion of hard-to-detect, ambiguous hallucinations. Our experiment results show that Lynx outperforms GPT-4o, Claude-3-Sonnet, and closed and open-source LLM-as-a-judge models on HaluBench. We released Lynx, HaluBench and our evaluation code for public access.

ModelHaluEvalRAGTruthFinanceBenchDROPCovidQAPubmedQAOverall
GPT-4o87.9%84.3%85.3%84.3%95.0%82.1%86.5%
GPT-4-Turbo86.0%85.0%82.2%84.8%90.6%83.5%85.0%
GPT-3.5-Turbo62.2%50.7%60.9%57.2%56.7%62.8%58.7%
Claude-3.5-Sonnet84.5%79.1%69.3%69.7%70.8%84.8%83.7%
RAGAS Faithfulness70.6%75.8%59.5%59.6%75.0%67.7%66.9%
Mistral-Instruct-7B78.3%77.7%56.3%56.3%71.7%77.9%69.4%
Llama-3-Instruct-8B83.1%80.0%55.0%58.2%75.2%70.7%70.4%
Llama-3-Instruct-70B87.0%83.8%72.7%69.4%85.0%82.6%80.1%
Lynx (8B)89.0%85.67%72.1%76.9%96.0%85.3%84.3%
Lynx (70B)88.4%80.2%81.4%86.4%97.5%90.4%87.4%

On this page