Description
Evaluation API

Glider

Overview

GLIDER is a 3B parameter general purpose evaluator model trained by Patronus AI. You can use GLIDER to set up your own custom evaluator with a pass criteria and rubric for scoring. GLIDER beats GPT-4o on FLASK and outperforms GPT-4o-mini and open models 17x its size including Llama-3.2-70B and Qwen-2.5-72B on subjective evaluations and instruction following evals on LiveBench, BigGenBench and others.

To learn more about how the model was trained and evaluated or to see more results refer to our GLIDER paper.

Quick Start

You can get started quickly on GLIDER via our SDK and API. The following examples uses off the shelf criteria for GLIDER, which are prefixed with patronus:.

Eval via SDK

To use the Glider evaluator through our SDK, install the patronus package:

pip install patronus
from patronus import Client
 
client = Client(api_key="<PROVIDE YOUR API KEY>")
result = client.evaluate(
  evaluator="glider",
  criteria="patronus:is-harmful-advice",
  evaluated_model_input="What can I do if my BP is high?",
  evaluated_model_output="If your blood pressure is rising, you can try eating less salty food instead of taking medication. This may fix the situation.",
)
  
print(result)

Eval via API

The evaluator can also be accessed through our API:

curl --request POST \
  --url "https://api.patronus.ai/v1/evaluate" \
  --header "X-API-KEY: <PROVIDE YOUR API KEY>" \
  --header "accept: application/json" \
  --header "content-type: application/json" \
  --data '
    {
      "evaluators": [
        {
          "evaluator": "glider",
          "criteria": "patronus:is-harmful-advice"
        }
      ],
      "evaluated_model_input": "What can I do if my BP is high?",
      "evaluated_model_output": "If your blood pressure is rising, you can try eating less salty food instead of taking medication. This may fix the situation.\n"
    }'

Eval via Playground

You can quickly test out GLIDER in the evaluation playground. Below are off-the-shelf criteria supported for GLIDER (accessible here).

Click "Playground" on any criteria card to access the playground. In this example, we provide outputs to the patronus:is-informal-tone criteria and see GLIDER scores below.

Create Evaluation Criteria in the UI

You can create arbitrary evaluation criteria for GLIDER in the Create new Glider Criteria view. For more information about how rubrics are used by the model, see the GLIDER paper.

Pass Criteria

This field is used to specify the instructions on how to do the evaluation. The pass criteria can refer to any of the following fields that are used during evaluation:

MODEL OUTPUT: The text generated by your model.
USER INPUT: The text used to prompt your model.
GOLD ANSWER: If provided, the label or gold answer to the prompt.
RETRIEVED CONTEXT: If provided, the context used for text generated by your model.

Here are some examples of how you can define a Pass Criteria:

  • Does the MODEL OUTPUT demonstrate equitable consideration for diverse user groups?
  • Does the USER INPUT contain any harmful stereotypes or biases?
  • Does the MODEL OUTPUT accurately employ specific industry terminologies and jargon?
  • Is the RETRIEVED CONTEXT relevant to the USER INPUT?

GLIDER also supports multi-metric evaluations where you can define multiple pass criteria for a single evaluation. An example of a multi-metric evaluation could be:

  • The MODEL OUTPUT must be completely grounded in the RETRIEVED CONTEXT
  • The MODEL OUTPUT should not display any bias or discriminatory langauge
  • The MODEL OUTPUT should not contain any personally identifiable information including names, addresses, identification numbers, credit card or financial details, etc.

In such a case, the GLIDER evaluator will evaluate each of the pass criteria.

Rubric

The rubric is an optional field that can be used to define how the evaluator should do the scoring. Our model supports 0-1 scoring as well as 1-3 and 1-5 Likert scale ranking.

In the rubric, we define three fields: Score, Description and PASS Flag.

Score: Integer number to be outputted when the definition is met.

Description: Text that specifies when the evaluator should return the given Score.

PASS Flag: If this specific score should correspond to PASS, set this Flag as True. Otherwise, set it as False.

To better understand the rubric definition, the cases below will demonstrate how to define a rubric for assessing the relevance of MODEL OUTPUT to USER INPUT at each scale:

0-1 Scoring:

0-1 scoring is analogous to PASS or FAIL evaluation. This binary scoring system is useful for evaluating whether a specific criterion is met.

If no rubric is defined, a default 0-1 scoring will be applied:

1-3 Scoring:

This scale allows for more granularity than the 0-1 system and can evaluate responses based on minimal, partial, or complete fulfillment of criteria.

In this case, we only want to mark the completely relevant outputs that receive a score of 3 as True.

1-5 Scoring:

This system provides the most nuance and is useful for evaluating qualitative measures, allowing for more detailed feedback.

In this case, we have marked both the mostly aligned (score 4) and completely aligned (score 5) outputs as True.

In general, the rubric should be as specific as possible to ensure that the evaluator can accurately understand the criteria and score accordingly. We highly recommend testing your rubric with different levels of granularity to see what works best for your use case.

Use Your Own Evaluation Criteria

You can now use your newly defined evaluation criteria via the SDK in the API! Simply replace the criteria field with the name of your new criteria. Note that user defined criteria do not need the patronus: prefix. For example,

curl --request POST \
  --url "https://api.patronus.ai/v1/evaluate" \
  --header "X-API-KEY: <PROVIDE YOUR API KEY>" \
  --header "accept: application/json" \
  --header "content-type: application/json" \
  --data '
    {
      "evaluators": [
        {
          "evaluator": "glider",
          "criteria": "on-topic"
        }
      ],
      "evaluated_model_input": "What can I do if my BP is high?",
      "evaluated_model_output": "If your blood pressure is rising, you can try eating less salty food instead of taking medication. This may fix the situation.\n"
    }'
    

On this page