Evaluator Reference Guide

Overview

Patronus supports a suite of high quality evaluators available in the evaluation API and python SDK. To use any of these evaluators, simply specify the evaluator name in the "evaluator" field in the above code snippet.

All evaluators return a binary PASS/FAIL result. When provided, raw scores are continuous and linearized on a 0-1 scale, where **FAIL=0 ** and PASS=1.

Evaluator	Definition	Required Fields	Context Window (tokens)	Raw Scores Provided
phi	Checks for protected health information (PHI), defined broadly as any information about an individual's health status or provision of healthcare.	`evaluated_model_output`	-	No
pii	Checks for personally identifiable information (PII). PII is information that, in conjunction with other data, can identify an individual.	`evaluated_model_output`	-	No
toxicity	Checks output for abusive and hateful messages.	`evaluated_model_output`	1024	Yes
hallucination	Checks whether the LLM response is hallucinatory, i.e. the output is not grounded in the provided context.	`evaluated_model_input`, `evaluated_model_output`, `evaluated_model_retrieved_context`	128k	Yes
hallucination-small	Checks whether the LLM response is hallucinatory, i.e. the output is not grounded in the provided context.	evaluated_model_input, evaluated_model_output, evaluated_model_retrieved_context	128K	Yes
lynx	Checks whether the LLM response is hallucinatory, i.e. the output is not grounded in the provided context. Uses Patronus Lynx to power the evaluation. See the research paper here.	`evaluated_model_input`, `evaluated_model_output`, `evaluated_model_retrieved_context`	8k	Yes
lynx-small	Checks whether the LLM response is hallucinatory, i.e. the output is not grounded in the provided context. Uses Patronus Lynx to power the evaluation. See the research paper here.	`evaluated_model_input`, `evaluated_model_output`, `evaluated_model_retrieved_context`	8k	Yes
answer-relevance	Checks whether the answer is on-topic to the input question. Does not measure correctness.	`evaluated_model_input`, `evaluated_model_output`	128k	Yes
answer-relevance-small	Checks whether the answer is on-topic to the input question. Does not measure correctness.	`evaluated_model_input`, `evaluated_model_output`	128k	Yes
context-relevance	Checks whether the retrieved context is on-topic to the input.	`evaluated_model_input`, `evaluated_model_retrieved_context`	128k	Yes
context-relevance-small	Checks whether the retrieved context is on-topic to the input.	`evaluated_model_input`, `evaluated_model_retrieved_context`	128k	Yes
context-sufficiency	Checks whether the retrieved context is sufficient to generate an output similar in meaning to the label. The label should be the correct evaluation result.	`evaluated_model_input`, `evaluated_model_retrieved_context`, `evaluated_model_output`, `evaluated_model_gold_answer`	128k	Yes
context-sufficiency-small	Checks whether the retrieved context is sufficient to generate an output similar in meaning to the label. The label should be the correct evaluation result.	`evaluated_model_input`, `evaluated_model_retrieved_context`, `evaluated_model_output`, `evaluated_model_gold_answer`	128k	Yes
nlp	Computes common NLP metrics on the output and label fields to measure semantic overlap and similarity. Currently supports `bleu` and `rouge` metrics.	`evaluated_model_output`, `evaluated_model_gold_answer`	-	Yes
judge	Checks against user-defined criteria definitions, such as "MODEL OUTPUT should be free from brackets." LLM-based and uses active learning to improve the criteria definition based on user feedback.	No required fields	128k	Yes
judge-small	Checks against user-defined criteria definitions, such as "MODEL OUTPUT should be free from brackets." LLM-based and uses active learning to improve the criteria definition based on user feedback.	No required fields	128k	Yes
system	Patronus created evaluation metrics.	`evaluated_model_input`, `evaluated_model_retrieved_context`, `evaluated_model_output`, `evaluated_model_gold_answer`	128k	Yes

Evaluator families group together evaluators that perform the same function. Evaluators in the same family share evaluator profiles and accept the same set of inputs. Importantly, the performance and cost of evaluators in the family may differ. Below we describe information for each evaluator family.

Glider

Glider is a 3B parameter evaluator model that can be used to setup any custom evaluation. It performs the evaluation based on pass criteria and score rubrics. To learn more about how to use GLIDER, check out our detailed documentation page .

Required Input Fields

No required fields

Optional Input Fields

evaluated_model_input
evaluated_model_output
evaluated_model_gold_answer
evaluated_model_retrieved_context

Aliases

Alias	Target
glider	glider-2024-12-11

Judge

Judge evaluators perform evaluations based on pass criteria defined in natural language, such as "The MODEL OUTPUT should be free from brackets". Judge evaluators also support active learning, which means that you can improve their performance by annotating historical evaluations with thumbs up or thumbs down. To learn more about Judge Evaluators, visit their documentation page.

Required Input Fields

No required fields

Optional Input Fields

evaluated_model_input
evaluated_model_output
evaluated_model_gold_answer
evaluated_model_retrieved_context

Aliases

Alias	Target
judge	judge-large-2024-08-08
judge-large	judge-large-2024-08-08
judge-small	judge-small-2024-08-08

Evaluators

Evaluator ID	Description
judge-large-2024-08-08	The most sophisticated evaluator in the family, using advanced reasoning to achieve high correctness.
judge-small-2024-08-08	The fastest and cheapest evaluator in the family.

PHI (Protected Health Information)

Checks for protected health information (PHI), defined broadly as any information about an individual's health status or provision of healthcare.

Required Input Fields

evaluated_model_output

Optional Input Fields

None

Aliases

Alias	Target
phi	phi-2024-05-31

Evaluators

Evaluator ID	Description
phi-2024-05-31	PHI detection in model outputs

NLP

Computes common NLP metrics on the output and label fields to measure semantic overlap and similarity. Currently supports the bleu and rouge frameworks.

Required Input Fields

evaluated_model_output
evaluated_model_gold_answer

Optional Input Fields

None

Aliases

Alias	Target
nlp	metrics-2024-05-16

Evaluators

Evaluator ID	Description
nlp-2024-05-16	Computes NLP metrics like `bleu`and `rouge`

Exact Match

Check that your model output is the same as the provided gold answer. Useful for checking boolean or multiple choice model outputs.

Required Input Fields

evaluated_model_output
evaluated_model_gold_answer

Optional Input Fields

None

Aliases

Alias	Target
exact-match	exact-match-2024-05-31

Evaluators

Evaluator ID	Description
exact-match-2024-05-31	Checks that model output and gold answer are the same

PII (Personally Identifiable Information)

Checks for personally identifiable information (PII). PII is information that, in conjunction with other data, can identify an individual.

Required Input Fields

evaluated_model_output

Optional Input Fields

None

Aliases

Alias	Target
pii	pii-2024-05-31

Evaluators

Evaluator ID	Description
pii-2024-05-31	PII detection in model outputs

Answer Relevance

Checks whether the model output is on-topic to the input question. Does not measure correctness.

Required Input Fields

evaluated_model_input
evaluated_model_output

Optional Input Fields

None

Aliases

Alias	Target
answer-relevance	answer-relevance-large-2024-07-23
answer-relevance-large	answer-relevance-large-2024-07-23
answer-relevance-small	answer-relevance-small-2024-07-23

Evaluators

Evaluator ID	Description
answer-relevance-large-2024-07-23	The most sophisticated evaluator in the family, using advanced reasoning to achieve high correctness.
answer-relevance-small-2024-07-23	The fastest and cheapest evaluator in the fam

Context Sufficiency

Checks whether the retrieved context is sufficient to generate an output similar in meaning to the label. The label should be the correct evaluation result.

Required Input Fields

evaluated_model_input
evaluated_model_gold_answer
evaluated_model_retrieved_context

Optional Input Fields

None

Aliases

Alias	Target
context-sufficiency	context-sufficiency-large-2024-07-23
context-sufficiency-large	context-sufficiency-large-2024-07-23
context-sufficiency-small	context-sufficiency-small-2024-07-23

Evaluators

Evaluator ID	Description
context-sufficiency-large-2024-07-23	The most sophisticated evaluator in the family, using advanced reasoning to achieve high correctness.
context-sufficiency-small-2024-07-23	The fastest and cheapest evaluator in the family.

Hallucination

Checks whether the LLM response is hallucinatory, i.e. the output is not grounded in the provided context.

Required Input Fields

evaluated_model_input
evaluated_model_output
evaluated_model_retrieved_context

Optional Input Fields

None

Aliases

Alias	Target
hallucination	hallucination-large-2024-07-23
hallucination-large	hallucination-large-2024-07-23
hallucination-small	hallucination-small-2024-07-23

Evaluators

Evaluator ID	Description
hallucination-large-2024-07-23	The most sophisticated evaluator in the family, using advanced reasoning to achieve high correctness.
hallucination-small-2024-07-23	The fastest and cheapest evaluator in the family.

Lynx

Checks whether the LLM response is hallucinatory, i.e. the output is not grounded in the provided context. Uses Patronus Lynx to power the evaluation. See the research paper here .

Required Input Fields

evaluated_model_input
evaluated_model_output
evaluated_model_retrieved_context

Optional Input Fields

None

Aliases

Alias	Target
lynx	lynx-large-2024-07-23
lynx-large	lynx-large-2024-07-23
lynx-small	lynx-small-2024-07-23

Evaluators

Evaluator ID	Description
lynx-large-2024-07-23	The most sophisticated evaluator in the family, using a large, 70B parameter model to achieve advanced reasoning and high correctness.
lynx-small-2024-07-23	The cheapest evaluator in the family, using a 8B parameter model to generate reliable and quick evaluations.

Context Relevance

Checks whether the retrieved context is on-topic or relevant to the input question.

Required Input Fields

evaluated_model_input
evaluated_model_retrieved_context

Optional Input Fields

None

Aliases

Alias	Target
context-relevance	context-relevance-large-2024-07-23
context-relevance-large	context-relevance-large-2024-07-23
context-relevance-small	context-relevance-small-2024-07-23

Evaluators

Evaluator ID	Description
context-relevance-large-2024-07-23	The most sophisticated evaluator in the family, using advanced reasoning to achieve high correctness.
context-relevance-small-2024-07-23	The fastest and cheapest evaluator in the family.

Toxicity

Checks output for abusive and hateful messages.

Required Input Fields

evaluated_model_output

Optional Input Fields

None

Aliases

Alias	Target
toxicity	toxicity-2024-05-16

Evaluators

Evaluator ID	Description
toxicity-2024-05-16	Detect Toxicity

Evaluator Reference Guide

On this page