Evaluator Families and Aliases

Evaluator families group together evaluators that perform the same function. Evaluators in the same family share evaluator profiles and accept the same set of inputs. Importantly, the performance and cost of evaluators in the family may differ.

Custom Evaluator Family

Custom evaluators perform evaluations based on pass criteria defined in natural language, such as "The MODEL OUTPUT should be free from brackets". Custom evaluators also support active learning, which means that you can improve their performance by annotating historical evaluations with thumbs up or thumbs down. To learn more about Custom Evaluators, visit their documentation page.

Required Input Fields

  • evaluated_model_input
  • evaluated_model_output

Optional Input Fields

  • evaluated_model_gold_answer

Aliases

AliasTarget
customcustom-large-2024-08-08
custom-largecustom-large-2024-08-08
custom-smallcustom-small-2024-08-08

Evaluators

Evaluator IDDescription
custom-large-2024-08-08The most sophisticated evaluator in the family, using advanced reasoning to achieve high correctness.
custom-small-2024-08-08The fastest and cheapest evaluator in the family.
custom-large-2024-07-23
custom-small-2024-07-23
custom-large-2024-05-16
custom-small-2024-05-16

PHI (Protected Health Information) Evaluator Family

Checks for protected health information (PHI), defined broadly as any information about an individual's health status or provision of healthcare.

Required Input Fields

  • evaluated_model_output

Optional Input Fields

None

Aliases

AliasTarget
phiphi-2024-05-31

Evaluators

Evaluator IDDescription
phi-2024-05-31PHI detection in model outputs

Metrics Evaluator Family

Computes common NLP metrics on the output and label fields to measure semantic overlap and similarity. Currently supports the bleu and rouge frameworks.

Required Input Fields

  • evaluated_model_output
  • evaluated_model_gold_answer

Optional Input Fields

None

Aliases

AliasTarget
metricsmetrics-2024-05-16

Evaluators

Evaluator IDDescription
metrics-2024-05-16Computes NLP metrics like bleuand rouge

Exact Match Evaluator Family

Check that your model output is the same as the provided gold answer. Useful for checking boolean or multiple choice model outputs.

Required Input Fields

  • evaluated_model_output
  • evaluated_model_gold_answer

Optional Input Fields

None

Aliases

AliasTarget
exact-matchexact-match-2024-05-31

Evaluators

Evaluator IDDescription
exact-match-2024-05-31Checks that model output and gold answer are the same

PII (Personally Identifiable Information) Evaluator Family

Checks for personally identifiable information (PII). PII is information that, in conjunction with other data, can identify an individual.

Required Input Fields

  • evaluated_model_output

Optional Input Fields

None

Aliases

AliasTarget
piipii-2024-05-31

Evaluators

Evaluator IDDescription
pii-2024-05-31PII detection in model outputs

Retrieval Answer Relevance Evaluator Family

Checks whether the model output is on-topic to the input question. Does not measure correctness.

Required Input Fields

  • evaluated_model_input
  • evaluated_model_output

Optional Input Fields

None

Aliases

AliasTarget
retrieval-answer-relevanceretrieval-answer-relevance-large-2024-07-23
retrieval-answer-relevance-largeretrieval-answer-relevance-large-2024-07-23
retrieval-answer-relevance-smallretrieval-answer-relevance-small-2024-07-23

Evaluators

Evaluator IDDescription
retrieval-answer-relevance-large-2024-07-23The most sophisticated evaluator in the family, using advanced reasoning to achieve high correctness.
retrieval-answer-relevance-small-2024-07-23The fastest and cheapest evaluator in the family.
retrieval-answer-relevance-large-2024-05-31
retrieval-answer-relevance-small-2024-05-31

Retrieval Context Sufficiency Evaluator Family

Checks whether the retrieved context is sufficient to generate an output similar in meaning to the label. The label should be the correct evaluation result.

Required Input Fields

  • evaluated_model_input
  • evaluated_model_output
  • evaluated_model_gold_answer
  • evaluated_model_retrieved_context

Optional Input Fields

None

Aliases

AliasTarget
retrieval-context-sufficiencyretrieval-context-sufficiency-large-2024-07-23
retrieval-context-sufficiency-largeretrieval-context-sufficiency-large-2024-07-23
retrieval-context-sufficiency-smallretrieval-context-sufficiency-small-2024-07-23

Evaluators

Evaluator IDDescription
retrieval-context-sufficiency-large-2024-07-23The most sophisticated evaluator in the family, using advanced reasoning to achieve high correctness.
retrieval-context-sufficiency-small-2024-07-23The fastest and cheapest evaluator in the family.
retrieval-context-sufficiency-large-2024-05-31
retrieval-context-sufficiency-small-2024-05-31

Retrieval Hallucination Evaluator Family

Checks whether the LLM response is hallucinatory, i.e. the output is not grounded in the provided context.

Required Input Fields

  • evaluated_model_input
  • evaluated_model_output
  • evaluated_model_retrieved_context

Optional Input Fields

None

Aliases

AliasTarget
retrieval-hallucinationretrieval-hallucination-large-2024-07-23
retrieval-hallucination-largeretrieval-hallucination-large-2024-07-23
retrieval-hallucination-smallretrieval-hallucination-small-2024-07-23

Evaluators

Evaluator IDDescription
retrieval-hallucination-large-2024-07-23The most sophisticated evaluator in the family, using advanced reasoning to achieve high correctness.
retrieval-hallucination-small-2024-07-23The fastest and cheapest evaluator in the family.
retrieval-hallucination-large-2024-05-31
retrieval-hallucination-small-2024-05-31

Lynx Retrieval Hallucination Evaluator Family

Checks whether the LLM response is hallucinatory, i.e. the output is not grounded in the provided context. Uses Patronus Lynx to power the evaluation. See the research paper here .

Required Input Fields

  • evaluated_model_input
  • evaluated_model_output
  • evaluated_model_retrieved_context

Optional Input Fields

None

Aliases

AliasTarget
retrieval-hallucination-lynxretrieval-hallucination-lynx-large-2024-07-23
retrieval-hallucination-lynx-largeretrieval-hallucination-lynx-large-2024-07-23
retrieval-hallucination-lynx-smallretrieval-hallucination-lynx-small-2024-07-23

Evaluators

Evaluator IDDescription
retrieval-hallucination-lynx-large-2024-07-23The most sophisticated evaluator in the family, using a large, 70B parameter model to achieve advanced reasoning and high correctness.
retrieval-hallucination-lynx-small-2024-07-23The cheapest evaluator in the family, using a 8B parameter model to generate reliable and quick evaluations.

Retrieval Context Relevance Family

Checks whether the retrieved context is on-topic or relevant to the input question.

Required Input Fields

  • evaluated_model_input
  • evaluated_model_retrieved_context

Optional Input Fields

None

Aliases

AliasTarget
retrieval-context-relevanceretrieval-context-relevance-large-2024-07-23
retrieval-context-relevance-largeretrieval-context-relevance-large-2024-07-23
retrieval-context-relevance-smallretrieval-context-relevance-small-2024-07-23

Evaluators

Evaluator IDDescription
retrieval-context-relevance-large-2024-07-23The most sophisticated evaluator in the family, using advanced reasoning to achieve high correctness.
retrieval-context-relevance-small-2024-07-23The fastest and cheapest evaluator in the family.
retrieval-context-relevance-large-2024-05-31
retrieval-context-relevance-small-2024-05-31

Toxicity Evaluator Family

Checks output for abusive and hateful messages.

Required Input Fields

  • evaluated_model_output

Optional Input Fields

None

Aliases

AliasTarget
toxicitytoxicity-2024-05-16

Evaluators

Evaluator IDDescription
toxicity-2024-05-16Detect Toxicity