Lynx v2.0 is an 8B State-of-the-Art RAG hallucination detection model 🚀

Lynx 2.0 was trained on long context data from real world domains like finance and medicine.

Lynx (8B) outperforms Claude-3.5-Sonnet as a judge on HaluBench by 2.2%
Lynx (8B) shows 3.4% higher accuracy than Lynx v1.1 on HaluBench
First hallucination guardrail trained on long context financial data
Detects 8 types of common hallucinations, including Coreference Errors, Calculation Errors, CoT hallucinations, and more

Hallucination Taxonomy

Lynx 2.0 supports 8 kinds of hallucinations.

Hallucination Type	Definition
Predicate Error	The predicate in the model output is inconsistent with the retrieved context.
Entity Error	The subject/object of a model output is inconsistent with the retrieved context.
Circumstance Error	Time, duration, or location of an event in the model output is wrong
Coreference Error	A pronoun/reference with wrong or nonexistent antecedent.
Calculation Errors	The calculation to arrive at a numerical answer is incorrect.
Chain of Thought Hallucinations	The chain of thought reasoning in a model output is unfaithful to the retrieved context.
Partially grounded answers	Part of the answer is grounded in the retrieved context but the other part of the answer is not supported by the retrieved context.
Unanswerable Questions	The question is not answerable using the retrieved context.

Benchmark Performance

We extend Halubench to include three additional datasets that capture the different types of hallucinations mentioned above. We include a long context dataset, QuALITY to capture long-context performance of the model. BUMP and squad capture additional types of hallucinations.

Model	BUMP	CovidQA	DROP	PubmedQA	QuALITY	RAGTruth	FinanceBench	squad	Average accuracy
meta-llama/Llama-3.2-3B-Instruct	32.40%	44.70%	47.40%	64.60%	36.60%	46.22%	47.90%	60.20%	47.50%
meta-llama/Llama-3.1-8B-Instruct	64.20%	83.00%	65.30%	80.50%	54.60%	76.67%	59.70%	86.00%	71.26%
GPT-4o mini	73.00%	87.20%	80.30%	84.20%	59.60%	81.88%	81.60%	81.80%	78.71%
Claude-3.5-Sonnet	77.20%	88.17%	81.82%	73.26%	62.33%	82.77%	82.40%	95.00%	80.37%
Lynx v1.1 (8B)	75.00%	96.90%	77.80%	88.90%	61.00%	80.11%	76.70%	76.80%	79.15%
Lynx v2.0 (8B)	77.50%	96.00%	76.90%	85.30%	68.40%	85.67%	72.10%	98.60%	82.56%
Lynx v1.0 (70B)	71.00%	97.50%	86.40%	90.40%	63.20%	80.22%	81.40%	87.60%	82.22%

How to use Lynx 2.0

Python SDK

Install the patronus sdk:

pip install patronus

Query Lynx via the SDK:

from patronus import Client

client = Client(api_key="<PROVIDE YOUR API KEY>")
result = client.evaluate(
  evaluator="lynx-small",
  criteria="patronus:hallucination",
  evaluated_model_input="What is the car insurance policy?",
  evaluated_model_output="To even qualify for our car insurance policy, you need to have a valid driver's license that expires later than 2028.",
  evaluated_model_retrieved_context="To qualify for our car insurance policy, you need a way to show competence in driving which can be accomplished through a valid driver's license. You must have multiple years of experience and cannot be graduating from driving school before or on 2028.",
)
  
print(result)

cURL Request

curl --request POST \
  --url "https://api.patronus.ai/v1/evaluate" \
  --header "X-API-KEY: <PROVIDE YOUR API KEY>" \
  --header "accept: application/json" \
  --header "content-type: application/json" \
  --data '
    {
      "evaluators": [
        {
          "evaluator": "lynx-small",
          "criteria": "patronus:hallucination"
        }
      ],
      "evaluated_model_input": "What is the car insurance policy?",
      "evaluated_model_output": "To even qualify for our car insurance policy, you need to have a valid driver's license that expires later than 2028.",
      "evaluated_model_retrieved_context": "To qualify for our car insurance policy, you need a way to show competence in driving which can be accomplished through a valid driver's license. You must have multiple years of experience and cannot be graduating from driving school before or on 2028."
    }'