Evaluator Reference Guide
Overview
Patronus supports a suite of high quality evaluators available in the evaluation API and python SDK. To use any of these evaluators, simply specify the evaluator name in the "evaluator" field in the above code snippet.
All evaluators return a binary PASS/FAIL result. When provided, raw scores are continuous and linearized on a 0-1 scale, where FAIL=0 and PASS=1.
Evaluator | Definition | Required Fields | Context Window (tokens) | Raw Scores Provided |
---|---|---|---|---|
phi | Checks for protected health information (PHI), defined broadly as any information about an individual's health status or provision of healthcare. | evaluated_model_output | - | No |
pii | Checks for personally identifiable information (PII). PII is information that, in conjunction with other data, can identify an individual. | evaluated_model_output | - | No |
toxicity | Checks output for abusive and hateful messages. | evaluated_model_output | 1024 | Yes |
hallucination | Checks whether the LLM response is hallucinatory, i.e. the output is not grounded in the provided context. | evaluated_model_input evaluated_model_output evaluated_model_retrieved_context | 128k | Yes |
hallucination-small | 128k | |||
lynx | Checks whether the LLM response is hallucinatory, i.e. the output is not grounded in the provided context. Uses Patronus Lynx to power the evaluation. See the research paper here . | evaluated_model_input evaluated_model_output evaluated_model_retrieved_context | 8k | Yes |
lynx-small | 128k | |||
answer-relevance | Checks whether the answer is on-topic to the input question. Does not measure correctness. | evaluated_model_input evaluated_model_output | 128k | Yes |
answer-relevance-small | 128k | |||
context-relevance | Checks whether the retrieved context is on-topic to the input. | evaluated_model_input evaluated_model_retrieved_context | 128k | Yes |
context-relevance-small | 128k | |||
context-sufficiency | Checks whether the retrieved context is sufficient to generate an output similar in meaning to the label. The label should be the correct evaluation result. | evaluated_model_input evaluated_model_retrieved_context evaluated_model_output evaluated_model_gold_answer | 128k | Yes |
context-sufficiency-small | 128k | |||
nlp | Computes common NLP metrics on the output and label fields to measure semantic overlap and similarity. Currently supports bleu and rouge metrics. | evaluated_model_output evaluated_model_gold_answer | - | Yes |
judge | Checks against user defined criteria definitions, such as "MODEL OUTPUT should be free from brackets." LLM based and uses active learning to improve the criteria definition based on user feedback. | No required fields | 128k | Yes |
judge-small | 128k | |||
system | Patronus created evaluation metrics. | evaluated_model_input evaluated_model_retrieved_context evaluated_model_output evaluated_model_gold_answer | 128k | Yes |
Evaluator families group together evaluators that perform the same function. Evaluators in the same family share evaluator profiles and accept the same set of inputs. Importantly, the performance and cost of evaluators in the family may differ. Below we describe information for each evaluator family.
Glider
Glider is a 3B parameter evaluator model that can be used to setup any custom evaluation. It performs the evaluation based on pass criteria and score rubrics. To learn more about how to use GLIDER, check out our detailed documentation page .
Required Input Fields
No required fields
Optional Input Fields
evaluated_model_input
evaluated_model_output
evaluated_model_gold_answer
evaluated_model_retrieved_context
Aliases
Alias | Target |
---|---|
glider | glider-2024-12-11 |
Judge
Judge evaluators perform evaluations based on pass criteria defined in natural language, such as "The MODEL OUTPUT should be free from brackets". Judge evaluators also support active learning, which means that you can improve their performance by annotating historical evaluations with thumbs up or thumbs down. To learn more about Judge Evaluators, visit their documentation page.
Required Input Fields
No required fields
Optional Input Fields
evaluated_model_input
evaluated_model_output
evaluated_model_gold_answer
evaluated_model_retrieved_context
Aliases
Alias | Target |
---|---|
judge | judge-large-2024-08-08 |
judge-large | judge-large-2024-08-08 |
judge-small | judge-small-2024-08-08 |
Evaluators
Evaluator ID | Description |
---|---|
judge-large-2024-08-08 | The most sophisticated evaluator in the family, using advanced reasoning to achieve high correctness. |
judge-small-2024-08-08 | The fastest and cheapest evaluator in the family. |
PHI (Protected Health Information)
Checks for protected health information (PHI), defined broadly as any information about an individual's health status or provision of healthcare.
Required Input Fields
evaluated_model_output
Optional Input Fields
None
Aliases
Alias | Target |
---|---|
phi | phi-2024-05-31 |
Evaluators
Evaluator ID | Description |
---|---|
phi-2024-05-31 | PHI detection in model outputs |
NLP
Computes common NLP metrics on the output and label fields to measure semantic overlap and similarity. Currently supports the bleu
and rouge
frameworks.
Required Input Fields
evaluated_model_output
evaluated_model_gold_answer
Optional Input Fields
None
Aliases
Alias | Target |
---|---|
nlp | metrics-2024-05-16 |
Evaluators
Evaluator ID | Description |
---|---|
nlp-2024-05-16 | Computes NLP metrics like bleu and rouge |
Exact Match
Check that your model output is the same as the provided gold answer. Useful for checking boolean or multiple choice model outputs.
Required Input Fields
evaluated_model_output
evaluated_model_gold_answer
Optional Input Fields
None
Aliases
Alias | Target |
---|---|
exact-match | exact-match-2024-05-31 |
Evaluators
Evaluator ID | Description |
---|---|
exact-match-2024-05-31 | Checks that model output and gold answer are the same |
PII (Personally Identifiable Information)
Checks for personally identifiable information (PII). PII is information that, in conjunction with other data, can identify an individual.
Required Input Fields
evaluated_model_output
Optional Input Fields
None
Aliases
Alias | Target |
---|---|
pii | pii-2024-05-31 |
Evaluators
Evaluator ID | Description |
---|---|
pii-2024-05-31 | PII detection in model outputs |
Answer Relevance
Checks whether the model output is on-topic to the input question. Does not measure correctness.
Required Input Fields
evaluated_model_input
evaluated_model_output
Optional Input Fields
None
Aliases
Alias | Target |
---|---|
answer-relevance | answer-relevance-large-2024-07-23 |
answer-relevance-large | answer-relevance-large-2024-07-23 |
answer-relevance-small | answer-relevance-small-2024-07-23 |
Evaluators
Evaluator ID | Description |
---|---|
answer-relevance-large-2024-07-23 | The most sophisticated evaluator in the family, using advanced reasoning to achieve high correctness. |
answer-relevance-small-2024-07-23 | The fastest and cheapest evaluator in the fam |
Context Sufficiency
Checks whether the retrieved context is sufficient to generate an output similar in meaning to the label. The label should be the correct evaluation result.
Required Input Fields
evaluated_model_input
evaluated_model_gold_answer
evaluated_model_retrieved_context
Optional Input Fields
None
Aliases
Alias | Target |
---|---|
context-sufficiency | context-sufficiency-large-2024-07-23 |
context-sufficiency-large | context-sufficiency-large-2024-07-23 |
context-sufficiency-small | context-sufficiency-small-2024-07-23 |
Evaluators
Evaluator ID | Description |
---|---|
context-sufficiency-large-2024-07-23 | The most sophisticated evaluator in the family, using advanced reasoning to achieve high correctness. |
context-sufficiency-small-2024-07-23 | The fastest and cheapest evaluator in the family. |
Hallucination
Checks whether the LLM response is hallucinatory, i.e. the output is not grounded in the provided context.
Required Input Fields
evaluated_model_input
evaluated_model_output
evaluated_model_retrieved_context
Optional Input Fields
None
Aliases
Alias | Target |
---|---|
hallucination | hallucination-large-2024-07-23 |
hallucination-large | hallucination-large-2024-07-23 |
hallucination-small | hallucination-small-2024-07-23 |
Evaluators
Evaluator ID | Description |
---|---|
hallucination-large-2024-07-23 | The most sophisticated evaluator in the family, using advanced reasoning to achieve high correctness. |
hallucination-small-2024-07-23 | The fastest and cheapest evaluator in the family. |
Lynx
Checks whether the LLM response is hallucinatory, i.e. the output is not grounded in the provided context. Uses Patronus Lynx to power the evaluation. See the research paper here .
Required Input Fields
evaluated_model_input
evaluated_model_output
evaluated_model_retrieved_context
Optional Input Fields
None
Aliases
Alias | Target |
---|---|
lynx | lynx-large-2024-07-23 |
lynx-large | lynx-large-2024-07-23 |
lynx-small | lynx-small-2024-07-23 |
Evaluators
Evaluator ID | Description |
---|---|
lynx-large-2024-07-23 | The most sophisticated evaluator in the family, using a large, 70B parameter model to achieve advanced reasoning and high correctness. |
lynx-small-2024-07-23 | The cheapest evaluator in the family, using a 8B parameter model to generate reliable and quick evaluations. |
Context Relevance
Checks whether the retrieved context is on-topic or relevant to the input question.
Required Input Fields
evaluated_model_input
evaluated_model_retrieved_context
Optional Input Fields
None
Aliases
Alias | Target |
---|---|
context-relevance | context-relevance-large-2024-07-23 |
context-relevance-large | context-relevance-large-2024-07-23 |
context-relevance-small | context-relevance-small-2024-07-23 |
Evaluators
Evaluator ID | Description |
---|---|
context-relevance-large-2024-07-23 | The most sophisticated evaluator in the family, using advanced reasoning to achieve high correctness. |
context-relevance-small-2024-07-23 | The fastest and cheapest evaluator in the family. |
Toxicity
Checks output for abusive and hateful messages.
Required Input Fields
evaluated_model_output
Optional Input Fields
None
Aliases
Alias | Target |
---|---|
toxicity | toxicity-2024-05-16 |
Evaluators
Evaluator ID | Description |
---|---|
toxicity-2024-05-16 | Detect Toxicity |
Updated 21 days ago