Our docs got a refresh! Check out the new content and improved navigation. For detailed API reference see our Python SDK docs and TypeScript SDK.
Description
Evaluators

Evaluator Reference Guide

Overview

Patronus supports a suite of high quality evaluators available in the evaluation API and python SDK.

All evaluators produce a binary pass-or-fail outcome. When raw scores are available, they are normalized onto a 0–1 scale, where 0 indicates a fail result and 1 indicates a pass result.

Evaluator families group together evaluators that perform the same function. Evaluators in the same family share evaluator profiles and accept the same set of inputs. Importantly, the performance and cost of evaluators in the family may differ. Below we describe information for each evaluator family.

Below are the evaluator categories available on platform.

Evaluator families

GLIDER

GLIDER is a 3B parameter evaluator model that can be used to setup any custom evaluation. It performs the evaluation based on pass criteria and score rubrics. It has a context window of 8K tokens. To learn more about how to use GLIDER, check out our detailed documentation page.

Required Input Fields

No required fields

Optional Input Fields

  • evaluated_model_input
  • evaluated_model_output
  • evaluated_model_gold_answer
  • evaluated_model_retrieved_context

Aliases

AliasTarget
gliderglider-2024-12-11

Judge

Judge evaluators perform evaluations based on pass criteria defined in natural language, such as "The MODEL OUTPUT should be free from brackets". Judge evaluators also support active learning, which means that you can improve their performance by annotating historical evaluations with thumbs up or thumbs down. To learn more about Judge Evaluators, visit their documentation page.

Required Input Fields

No required fields

Optional Input Fields

  • evaluated_model_input
  • evaluated_model_output
  • evaluated_model_gold_answer
  • evaluated_model_retrieved_context

Aliases

AliasTarget
judgejudge-large-2024-08-08
judge-smalljudge-small-2024-08-08
judge-largejudge-large-2024-08-08

Evaluators

Evaluator IDDescription
judge-small-2024-08-08The fastest and cheapest evaluator in the family
judge-large-2024-08-08The most sophisticated evaluator in the family, using advanced reasoning to achieve high correctness

Pre-built Judge Criteria

Patronus provides pre-built judge evaluator criteria developed by our research team and continually assessed for performance on real world benchmarks.

CriteriaDescriptionRequired FieldsCollection
answer-refusalChecks that the model output refuses to answer the user input. Useful for handling prompt injections and off-topic user inputs.evaluated_model_outputOWASP
fuzzy-matchVerifies that the model output is semantically similar to the provided gold answer. Useful when exact matches aren't expected but the meaning must align with the gold answer.evaluated_model_output, evaluated_model_gold_answerOutput Validation
is-conciseEnsures the model output is clear and concise, especially valuable for chatbot use cases.evaluated_model_outputChatbot Behavior
is-helpfulChecks if the model maintains a helpful tone of voice, ideal for chatbot use cases.evaluated_model_outputChatbot Behavior
is-politeValidates that the model output maintains politeness during conversations.evaluated_model_outputChatbot Behavior
no-apologiesEnsures the model output avoids unnecessary apologies. Useful for delivering clear and direct communication.evaluated_model_outputChatbot Behavior
no-openai-referenceVerifies that the model output does not reference being created by OpenAI.evaluated_model_outputChatbot Behavior, Harmful Outputs
is-codeEnsures the model output is valid code. Ideal for validating outputs from AI coding assistants.evaluated_model_outputOutput Format
is-csvConfirms that the model output is a valid CSV document. Useful for parsing and ensuring expected CSV format.evaluated_model_outputOutput Format
is-jsonConfirms that the model output is valid JSON. Useful for parsing and ensuring expected JSON format.evaluated_model_outputOutput Format
no-age-biasChecks that the model is not biased based on ages mentioned in the user input. Ensures consistent outputs regardless of user age.evaluated_model_output, evaluated_model_gold_answerHarmful Outputs
no-gender-biasValidates that the model output avoids gender stereotypes. Reduces risks of sexist or gendered outputs.evaluated_model_outputHarmful Outputs
no-racial-biasValidates that the model output avoids racial stereotypes. Reduces risks of producing racist outputs.evaluated_model_outputHarmful Outputs

Retrieval systems

Retrieval systems

EvaluatorDefinitionRequired FieldsMax Input TokensRaw Scores Provided
answer-relevance-smallChecks whether the answer is on-topic to the input question. Does not measure correctness.evaluated_model_input, evaluated_model_output124kYes
answer-relevance-largeChecks whether the answer is on-topic to the input question. Does not measure correctness.evaluated_model_input, evaluated_model_output124kYes
context-relevance-smallChecks whether the retrieved context is on-topic to the input.evaluated_model_input, evaluated_model_retrieved_context124kYes
context-relevance-largeChecks whether the retrieved context is on-topic to the input.evaluated_model_input, evaluated_model_retrieved_context124kYes
context-sufficiency-smallChecks whether the retrieved context is sufficient to generate an output similar in meaning to the label. The label should be the correct evaluation result.evaluated_model_input, evaluated_model_retrieved_context, evaluated_model_output, evaluated_model_gold_answer124kYes
context-sufficiency-largeChecks whether the retrieved context is sufficient to generate an output similar in meaning to the label. The label should be the correct evaluation result.evaluated_model_input, evaluated_model_retrieved_context, evaluated_model_output, evaluated_model_gold_answer124kYes
hallucination-smallChecks whether the LLM response is hallucinatory, i.e. the output is not grounded in the provided context.evaluated_model_input, evaluated_model_output, evaluated_model_retrieved_context124kYes
hallucination-largeChecks whether the LLM response is hallucinatory, i.e. the output is not grounded in the provided context.evaluated_model_input, evaluated_model_output, evaluated_model_retrieved_context124kYes
lynx-smallChecks whether the LLM response is hallucinatory, i.e. the output is not grounded in the provided context. Uses Patronus Lynx to power the evaluation. See the research paper here.evaluated_model_input, evaluated_model_output, evaluated_model_retrieved_context124kYes

Answer Relevance

Checks whether the model output is relevant to the input question. Does not evaluate factual correctness. Useful for prompt engineering on retrieval systems when trying to improve model output relevance to user query.

Required Input Fields

  • evaluated_model_input
  • evaluated_model_output

Optional Input Fields

None

Evaluators

Evaluator IDDescription
answer-relevance-small-2024-07-23The fastest and cheapest evaluator in the family
answer-relevance-large-2024-07-23The most sophisticated evaluator in the family, using advanced reasoning to achieve high correctness.

Aliases

AliasTarget
answer-relevanceanswer-relevance-large-2024-07-23
answer-relevance-smallanswer-relevance-small-2024-07-23
answer-relevance-largeanswer-relevance-large-2024-07-23

Context Relevance

Checks whether the retrieved context is on-topic or relevant to the input question. Useful when checking the retriever performance of a retrieval system.

Required Input Fields

  • evaluated_model_input
  • evaluated_model_retrieved_context

Optional Input Fields

None

Evaluators

Evaluator IDDescription
context-relevance-small-2024-07-23The fastest and cheapest evaluator in the family
context-relevance-large-2024-07-23The most sophisticated evaluator in the family, using advanced reasoning to achieve high correctness

Aliases

AliasTarget
context-relevancecontext-relevance-large-2024-07-23
context-relevance-smallcontext-relevance-small-2024-07-23
context-relevance-largecontext-relevance-large-2024-07-23

Context Sufficiency

Checks whether the retrieved context is sufficient to generate an output similar to the gold label. The gold label is the correct evaluation result. Useful when checking the retriever performance of a retrieval system.

Required Input Fields

  • evaluated_model_input
  • evaluated_model_gold_answer
  • evaluated_model_retrieved_context

Optional Input Fields

None

Evaluators

Evaluator IDDescription
context-sufficiency-small-2024-07-23The fastest and cheapest evaluator in the family
context-sufficiency-large-2024-07-23The most sophisticated evaluator in the family, using advanced reasoning to achieve high correctness

Aliases

AliasTarget
context-sufficiencycontext-sufficiency-large-2024-07-23
context-sufficiency-smallcontext-sufficiency-small-2024-07-23
context-sufficiency-largecontext-sufficiency-large-2024-07-23

Hallucination (non-Lynx)

Checks whether the LLM response is hallucinating information that is not present in the context.

Required Input Fields

  • evaluated_model_input
  • evaluated_model_output
  • evaluated_model_retrieved_context

Optional Input Fields

None

Evaluators

Evaluator IDDescription
hallucination-small-2024-07-23The fastest and cheapest evaluator in the family
hallucination-large-2024-07-23The most sophisticated evaluator in the family, using advanced reasoning to achieve high correctness

Aliases

AliasTarget
hallucinationhallucination-large-2024-07-23
hallucination-smallhallucination-small-2024-07-23
hallucination-largehallucination-large-2024-07-23

Hallucination (Lynx)

Checks whether the LLM response is hallucinating information that is not present in the context. Finetuned for domain context. See the research paper here.

Required Input Fields

  • evaluated_model_input
  • evaluated_model_output
  • evaluated_model_retrieved_context

Optional Input Fields

None

Evaluators

Evaluator IDDescription
lynx-small-2024-07-23The best evaluator in the family, using a 8B parameter model to generate reliable and quick evaluations

Aliases

AliasTarget
lynxlynx-small-2024-07-23
lynx-smalllynx-small-2024-07-23

OWASP

OWASP

EvaluatorDefinitionRequired FieldsMax Input TokensRaw Scores Provided
answer-refusalJudge evaluator that checks whether "MODEL OUTPUT" refuses to answer the user input. Useful to check whether your model output is correctly handling prompt injections and off topic user inputs.evaluated_model_output124kYes
prompt-injectionJudge evaluator that checks whether "MODEL INPUT" contains prompt injections.evaluated_model_input124kYes
toxicityChecks output for abusive and hateful messages.evaluated_model_output512Yes
piiChecks for personally identifiable information (PII). PII is information that, in conjunction with other data, can identify an individual.evaluated_model_output16kNo
no-openai-referenceJudge evaluator that checks whether "MODEL OUTPUT" contains a reference to OpenAI.evaluated_model_output124kYes

NLP

NLP

Computes common NLP metrics on the output and label fields to measure semantic overlap and similarity. Currently supports the bleu and rouge frameworks.

Required Input Fields

  • evaluated_model_output
  • evaluated_model_gold_answer

Optional Input Fields

None

Evaluators

Evaluator IDDescription
nlp-2024-05-16Computes common NLP metrics

Aliases

AliasTarget
nlpnlp-2024-05-16

Criteria

Criteria NameDescription
patronus:bleuComputes the BLEU score focused on precision.
patronus:rougeComputes the ROUGE score focused on recall.

Harmful outputs

Harmful outputs

EvaluatorDefinitionRequired FieldsMax Input TokensRaw Scores Provided
no-age-biasJudge evaluator that checks whether "MODEL OUTPUT" is biased by ages mentioned in the user input. Useful to ensure your model outputs are consistent, even as user age fluctuates.evaluated_model_output124kYes
no-racial-biasJudge evaluator that checks whether "MODEL OUTPUT" contains gender stereotypes. Used to mitigate PR risk from sexist or gendered model outputs.evaluated_model_input124kYes
no-gender-biasJudge evaluator that checks whether model output addresses any racial stereotypes or not. Use to mitigate PR risk from racist model outputs.evaluated_model_output512Yes
toxicityChecks output for abusive and hateful messages.evaluated_model_output512Yes

Toxicity

Checks output for abusive and hateful messages.

Required Input Fields

  • evaluated_model_output

Optional Input Fields

None

Evaluators

Evaluator IDDescription
toxicity-2024-10-27Detects toxicity using an internal Patronus model
toxicity-2024-07-23Detects toxicity using the Perspective API

Aliases

AliasTarget
toxicitytoxicity-2024-10-27
toxicity-perspective-apitoxicity-2024-07-23

Chatbot behavior

Chatbot behavior

EvaluatorDefinitionRequired FieldsMax Input TokensRaw Scores Provided
is-conciseJudge evaluator that checks whether "MODEL OUTPUT" is biased by ages mentioned in the user input. Useful to ensure your model outputs are consistent, even as user age fluctuates.evaluated_model_output124kYes
no-apologiesJudge evaluator that checks whether "MODEL OUTPUT" does not contain apologies. Useful if you want your model to communicate difficult messages clearly, uncluttered by apologies.evaluated_model_output124kYes
is-politeJudge evaluator that checks whether is polite in conversation. Very useful for chatbot use cases.evaluated_model_output124kYes
is-helpfulJudge evaluator that checks your model is helpful in its tone of voice, very useful for chatbot use cases.evaluated_model_output124kYes
no-openai-referenceJudge evaluator that checks your model does not refer to being made by OpenAI.evaluated_model_output124kYes

Output format

Output format

EvaluatorDefinitionRequired FieldsMax Input TokensRaw Scores Provided
is-codeJudge evaluator that checks whether "MODEL OUTPUT" is valid code. Check that your model output is valid code. Use this profile to check that your code copilot or AI coding assistant is producing expected outputs.evaluated_model_output124kYes
is-csvJudge evaluator that checks whether "MODEL OUTPUT" is a valid CSV document. Useful if you’re parsing your model outputs and want to ensure it is CSV.evaluated_model_output124kYes
is-jsonJudge evaluator that checks whether "MODEL OUTPUT" is valid JSON. Useful if you’re parsing your model outputs and want to ensure it is JSON.evaluated_model_output124KYes

Data leakage

Data leakage

EvaluatorDefinitionRequired FieldsMax Input TokensRaw Scores Provided
piiChecks for personally identifiable information (PII). PII is information that, in conjunction with other data, can identify an individual.evaluated_model_output16kNo
phiChecks for protected health information (PHI), defined broadly as any information about an individual's health status or provision of healthcare.evaluated_model_output16kNo

PII (Personally Identifiable Information)

Checks for personally identifiable information (PII). PII is information that, in conjunction with other data, can identify an individual.

Required Input Fields

  • evaluated_model_output

Optional Input Fields

None

Evaluators

Evaluator IDDescription
pii-2024-05-31PII detection in model outputs

Aliases

AliasTarget
piipii-2024-05-31

PHI (Protected Health Information)

Checks for protected health information (PHI), defined broadly as any information about an individual's health status or provision of healthcare.

Required Input Fields

  • evaluated_model_output

Optional Input Fields

None

Evaluators

Evaluator IDDescription
phi-2024-05-31PHI detection in model outputs

Aliases

AliasTarget
phiphi-2024-05-31

Output validation

Output validation

EvaluatorDefinitionRequired FieldsMax Input TokensRaw Scores Provided
exact-matchChecks if two strings are exactly identical.evaluated_model_output, evaluated_model_gold_answer32kNo
fuzzy-matchJudge evaluator that checks that your model output is semantically similar to the provided gold answer.evaluated_model_output, evaluated_model_gold_answer124kYes

Exact Match

Checks that your model output is the exact same string as the provided gold answer. Useful for checking boolean or multiple choice model outputs.

Required Input Fields

  • evaluated_model_output
  • evaluated_model_gold_answer

Optional Input Fields

None

Evaluators

Evaluator IDDescription
exact-match-2024-05-31Checks that model output and gold answer are the same

Aliases

AliasTarget
exact-matchexact-match-2024-05-31