Evaluators

Evaluators are automated infrastructure that execute evaluations to produce evaluation results, i.e. measurable assessments of an AI system's performance. For example, an evaluator scoring the relevance of retrieved chunks can be used to assess retriever quality in a RAG pipeline, and an evaluator detecting toxic outputs can be used to protect chatbot users from unsafe content.

Evaluators can be powered by heuristic systems, classifiers, carefully tuned LLM judges or user-defined functions. Patronus provides a suite of state-of-the-art Evaluators off-the-shelf covering a broad set of categories that are benchmarked for accuracy and human alignment.
Evaluators are versioned by the date they are released. So, judge-large-2024-10-14 is the large version of the judge evaluator released on 10/14/2024.
Some evaluators have different size and performance tiers, which you can tailor to use cases based on the task complexity and speed requirements. For example, context-relevance-large produces evaluation results with higher latency and supports more complex reasoning, compared to context-relevance-small.

Judge Evaluators

Judge evaluators are LLM powered evaluators that execute evaluations on user defined criteria. A Judge evaluator must be instantiated with a criterion. For example, a no-apologies judge evaluator flags outputs that contain apologies, and the is-concise judge evaluator scores the conciseness of the output. defines the

Judge Evaluator Criterion

Judge evaluators are tuned and aligned to user preferences with specified criteria. A criterion contains a natural language instruction that defines the evaluation outcome. You can create and manage Judge evaluators in the Patronus platform to tailor evaluations to your use case.
Patronus supports in-house judge evaluators applicable to a variety of use cases. Patronus judge evaluators have the patronus: prefix and have broad evaluation coverage, including tone of voice, format, safety and common constraints.

Evaluator Family

An evaluator family is a grouping of Evaluators that execute the evaluation.

For example, the judge family groups together all judge evaluators - like judge-large and judge-small. They share the same criteria like patronus:is-concise.

Evaluator Alias

Aliases let you refer to the latest and most advanced evaluator in the Evaluator Family, rather than specifying an exact version.
For example, you can refer to the alias judge-large which always points to the newest large custom evaluator. For example, if you directly reference judge-large-2024-05-16, you'll need to manually update this to judge-large-2024-11-14 when a new version is available.

Evaluation

An evaluation describes an execution of a performance assessment for an LLM or GenAI system. Evaluations produce evaluation results. Evaluations assess the accuracy, quality, format and capabilities of your AI system to identify risks and provide critical insights to improve system performance.

In Patronus, an evaluation refers to an assessment of your LLM system at a single point in time. An evaluation takes in a snapshot of your LLM system, which includes the following information:

evaluated_model_input: The user input to the model you are evaluating
evaluated_model_output: The output of the model you are evaluating
evaluated_model_retrieved_context: Any extra context passed to the model you are evaluating, like from a retrieval system.
evaluated_model_gold_answer: The "correct" or expected answer to the user input.

Evaluation Result

Evaluations produce evaluation results, which contain outputs from Evaluators. Evaluation results are returned in the output of the POST /v1/evaluate API and can be viewed in the Logs dashboard. They are also returned by Evaluation Runs and the Patronus Experimentation SDK (both currently Enterprise features).

Evaluation Results contain all inputs for the evaluation as well as a PASS/FAIL rating, raw score, and evaluator-specific metadata.