GLIDER
Glider is a 3B parameter custom evaluator model trained by Patronus AI. GLIDER can score any text input and associated context on arbitrary user-defined criteria.
- It shows higher Pearson’s correlation than GPT-4o on FLASK.
- It outperforms prior judge models, achieving comparable performance to LLMs 17× its size.
- Supports fine-grained scoring, multilingual reasoning, and span highlighting.
We train and align a Phi-3.5-mini-instruct model on synthetic data that spans 183 different research and industrial evaluation metrics from 685 relevant domains of application to prove that Grading LLM Interactions and Decisions using Explainable Ranking can help improve performance. GLIDER is capable of performing evaluations on arbitrary inputs and producing 0-1, 1-3, and 1-5 Likert scale rankings, along with high-quality reasoning chains and text highlight spans for improved analysis of failures.
Training
We use a mixture of synthetic datasets and openly available datasets to train the model.
We created a detailed taxonomy of potential metrics to cover along with their definitions, spanning 685 unique domains like finance, medicine, and technology to more creative domains like art, fashion, and films. To ensure that the model does not overfit to a single evaluation field like user input or model output, we diversify our dataset by forcing associations arbitrarily between random tag names representing inputs, outputs, contexts, and gold answers. This pointwise data generation is used for the RLAIF alignment training phase, where we use rejected samples to lower their probabilities and increase the probabilities of chosen samples.
We chose phi-3.5-mini-instruct as our base model. We performed supervised fine-tuning (SFT) for one epoch. Following this, we aligned the model with the APO zero loss since our synthetic data contains noise, and APO has been shown to be more robust in such situations. In addition to this preference optimization loss, we added a standard cross-entropy term, ensuring that the model continues to capture data nuances in the alignment phase.
To read more details about the data generation and training, refer to our paper: https://arxiv.org/abs/2412.14140
Results
GLIDER achieves state-of-the-art performance on the FLASK benchmark, beating GPT-4o while still performing close to models 17× its size on the Feedback Collection dataset.
Pearson correlation for various models on ranking tasks against human ratings
Model | BigGen Bench | FLASK | Feedback Bench | Summeval (Relevance) | Summeval (Consistency) | Summeval (Coherence) | Summeval (Fluency) | Average |
---|---|---|---|---|---|---|---|---|
GPT-4o | 0.614 | 0.610 | 0.810 | 0.312 | 0.550 | 0.419 | 0.522 | 0.548 |
GPT-4o-mini | 0.231 | 0.565 | 0.803 | 0.431 | 0.425 | 0.423 | 0.283 | 0.452 |
Claude-3.5-Sonnet | 0.592 | 0.592 | 0.812 | 0.464 | 0.620 | 0.497 | 0.496 | 0.582 |
Llama-3.1-70B | 0.580 | 0.572 | 0.792 | 0.391 | 0.497 | 0.527 | 0.391 | 0.536 |
Qwen-2.5-72B | 0.560 | 0.581 | 0.791 | 0.457 | 0.443 | 0.431 | 0.534 | 0.542 |
Phi-3.5-mini-instruct | 0.294 | 0.331 | 0.731 | 0.245 | 0.166 | 0.261 | 0.266 | 0.328 |
Prometheus-2-8x7B | 0.524 | 0.555 | 0.898 | 0.287 | 0.320 | 0.328 | 0.293 | 0.458 |
Prometheus-2-7B | 0.392 | 0.545 | 0.882 | 0.216 | 0.188 | 0.236 | 0.134 | 0.370 |
FlowAI Judge 3.8B | 0.460 | 0.400 | 0.787 | 0.286 | 0.358 | 0.351 | 0.309 | 0.422 |
GLIDER 3.8B (w/o highlights) | 0.490 | 0.570 | 0.759 | 0.367 | 0.418 | 0.433 | 0.321 | 0.480 |
GLIDER 3.8B | 0.604 ±0.005 | 0.615 ±0.01 | 0.774 ±0.01 | 0.398 ±0.02 | 0.522 ±0.01 | 0.462 ±0.01 | 0.365 ±0.03 | 0.534 |
Table 1: Bolded text indicates best overall and italicized text indicates best open-source judge model.
Performance (F1 score) comparison of models on pairwise ranking datasets
Model | Live Bench (IF) | HH Eval (Harm) | HH Eval (Help) | HH Eval (Hon) | MT Bench | Reward Bench (Chat) | Reward Bench (Chat-Hard) | Reward Bench (Safe) | Reward Bench (Reason) | Reward Bench (Average) | Average |
---|---|---|---|---|---|---|---|---|---|---|---|
GPT-4o | 0.661 | 0.983 | 0.898 | 0.831 | 0.813 | 0.950 | 0.697 | 0.861 | 0.893 | 0.850 | 0.843 |
GPT-4o-mini | 0.481 | 0.948 | 0.863 | 0.812 | 0.786 | 0.943 | 0.566 | 0.802 | 0.859 | 0.793 | 0.784 |
Claude-3.5-Sonnet | 0.632 | 0.944 | 0.915 | 0.868 | 0.807 | 0.618 | 0.827 | 0.898 | 0.821 | 0.849 | 0.814 |
Llama-3.1-70B | 0.651 | 0.913 | 0.898 | 0.898 | 0.802 | 0.577 | 0.800 | 0.877 | 0.802 | 0.826 | 0.802 |
Qwen-2.5-72B | 0.485 | 0.965 | 0.915 | 0.847 | 0.798 | 0.949 | 0.612 | 0.839 | 0.888 | 0.822 | 0.810 |
Phi-3.5-mini-instruct | 0.344 | 0.775 | 0.745 | 0.672 | 0.223 | 0.844 | 0.451 | 0.717 | 0.759 | 0.693 | 0.614 |
Prometheus-2-8x7B | - | 0.966 | 0.848 | 0.820 | 0.551 | 0.930 | 0.471 | 0.835 | 0.774 | 0.753 | - |
Prometheus-2-7B | - | 0.793 | 0.728 | 0.771 | 0.504 | 0.855 | 0.491 | 0.771 | 0.765 | 0.720 | - |
FlowAI Judge 3.8B | 0.592 | 0.896 | 0.779 | 0.734 | 0.549 | 0.895 | 0.572 | 0.786 | 0.657 | 0.728 | 0.719 |
GLIDER 3.8B (w/o highlights) | 0.542 | 0.946 | 0.829 | 0.783 | 0.577 | 0.835 | 0.577 | 0.797 | 0.904 | 0.778 | 0.754 |
GLIDER 3.8B | 0.654 ±0.04 | 0.946 ±0.003 | 0.830 ±0.005 | 0.778 ±0.002 | 0.628 ±0.06 | 0.876 ±0.005 | 0.575 ±0.002 | 0.797 ±0.01 | 0.888 ±0.01 | 0.784 ±0.006 | 0.776 |
Table 2: Bolded text indicates best overall and italicized text indicates best open-source judge model.