RAG Metrics
RAG Evaluation
Retrieval systems are complicated to test, evaluate, and debug because there are multiple possible points of failure. At Patronus, we use the following helpful framework to disentangle the complexity and identify where issues are coming from.
Retrieval Framework
Entities
- User Input: User question or statement.
- Retrieved Context: Context retrieved by the RAG system given the user input.
- Model Output: Answer generated by the model given the user input and retrieved context.
- Gold Answer: Reference answer to the user input.
Metrics
- Answer Relevance: Measures whether the model answer is relevant to the user input.
- Context Relevance: Measures whether the retrieved context is relevant to the user input.
- Answer Hallucination (Faithfulness): Measures whether the generated model output is faithful to the retrieved context.
The following metrics require access to a gold/reference answer:
- Answer Correctness: Measures whether the model output aligns with the gold answer.
- Context Correctness: Measures whether the entities/facts in the retrieved context agree with those in the gold answer.
- Context Sufficiency / Context Recall: Measures whether the retrieved context is sufficient to answer the user input according to the gold answer.
Patronus Evaluators
Patronus can currently detect the following problems:
- Answer Relevance:
retrieval-answer-relevance
- Context Relevance:
retrieval-context-relevance
- Context Sufficiency:
retrieval-context-sufficiency
- Answer Hallucination (Faithfulness):
retrieval-hallucination
Hallucination Examples
We define a hallucination as a model output that is not faithful to the retrieved context. Here are a few examples from the retrieval_hallucination evaluator family to motivate what this means
- Expected response:
FAIL
- Reason: The answer states that the population growth of New York from 2022 to 2023 was 1 million people. However, the retrieved context provides information about the population of_ New Jersey_. Hence, the output does not remain faithful to the retrieved context and is hallucinated!
This hallucination is easy to catch because the retrieved context doesn't mention New York while the output does. But slightly amending the evaluated_model_output in the example results in a hallucination that is much harder to catch:
- Expected response:
FAIL
Here is an example where the output is faithful to the context.
- Expected response:
PASS
Note that in the evaluated_model_retrieved_context array, you can pass different pieces of context as separate items, as seen above. }