Using Evaluators

Estimated Time: 7 mins

How to use Evaluators

Evaluators are at the heart the Patronus Evaluation API. In this guide you'll learn about the diversity of Patronus evaluators and how to use them to score AI outputs against a broad set of criteria.

You can pass a list of evaluators to each evaluation API request. The evaluator name must be provided in the "evaluator" field. For example, run the following to query Lynx:

curl --location 'https://api.patronus.ai/v1/evaluate' \
--header 'X-API-KEY: YOUR_API_KEY' \
--header 'Content-Type: application/json' \
--data '{
  "evaluators": [
    {
      "evaluator": "retrieval-hallucination-lynx",
      "explain_strategy": "always"
    }
  ],
  "evaluated_model_input": "Who are you?",
  "evaluated_model_output": "My name is Barry.",
  "evaluated_model_retrieved_context": ["I am John."],
  "tags": {"experiment": "quick_start_tutorial"}
}'

What Evaluators are supported?

Patronus supports a suite of high quality evaluators in our evaluation API. To use any of these evaluators, simply put the evaluator family name in the "evaluator" field in the above code snippet.

Evaluator FamilyDefinitionRequired FieldsScore Type
phiChecks for protected health information (PHI), defined broadly as any information about an individual's health status or provision of healthcare.evaluated_model_outputBinary
piiChecks for personally identifiable information (PII). PII is information that, in conjunction with other data, can identify an individual.evaluated_model_outputBinary
toxicityChecks output for abusive and hateful messages.evaluated_model_outputContinuous
retrieval-hallucination-lynxChecks whether the LLM response is hallucinatory, i.e. the output is not grounded in the provided context.evaluated_model_input
evaluated_model_output
evaluated_model_retrieved_context
Binary
retrieval-answer-relevanceChecks whether the answer is on-topic to the input question. Does not measure correctness.evaluated_model_input
evaluated_model_output
Binary
retrieval-context-relevanceChecks whether the retrieved context is on-topic to the input.evaluated_model_input
evaluated_model_retrieved_context
Binary
retrieval-context-sufficiencyChecks whether the retrieved context is sufficient to generate an output similar in meaning to the label. The label should be the correct evaluation result.evaluated_model_input
evaluated_model_retrieved_context
evaluated_model_output
evaluated_model_gold_answer
Binary
metricsComputes common NLP metrics on the output and label fields to measure semantic overlap and similarity. Currently supports bleu and rouge metrics. evaluated_model_output
evaluated_model_gold_answer
Continuous
customChecks against custom criteria definitions, such as "MODEL OUTPUT should be free from brackets."

LLM based and uses active learning to improve the criteria definition based on user feedback.
evaluated_model_input
evaluated_model_output
evaluated_model_gold_answer
Binary

If you'd like to create your own evaluator, read the following sections on custom evaluators.