Core Concepts π‘
Estimated Time: 2 minutes
Evaluators
Evaluators are automated infrastructure that execute evaluations to produce evaluation results, i.e. measurable assessments of an AI system's performance. For example, an evaluator scoring the relevance of retrieved chunks can be used to assess retriever quality in a RAG pipeline, and an evaluator detecting toxic outputs can be used to protect chatbot users from unsafe content.
- Evaluators can be powered by heuristic systems, classifiers, carefully tuned LLM judges or user-defined functions. Patronus provides a suite of state-of-the-art Evaluators off-the-shelf covering a broad set of categories that are benchmarked for accuracy and human alignment.
- Evaluators are versioned by the date they are released. So,
judge-large-2024-10-14
is the large version of the judge evaluator released on10/14/2024
. - Some evaluators have different size and performance tiers, which you can tailor to use cases based on the task complexity and speed requirements. For example,
context-relevance-large
produces evaluation results with higher latency and supports more complex reasoning, compared tocontext-relevance-small
.
Judge Evaluators
Judge evaluators are LLM powered evaluators that execute evaluations on user defined criteria. A Judge evaluator must be instantiated with a criterion. For example, a no-apologies judge evaluator flags outputs that contain apologies, and the is-concise judge evaluator scores the conciseness of the output. defines the
Judge Evaluator Criterion
- Judge evaluators are tuned and aligned to user preferences with specified criteria. A criterion contains a natural language instruction that defines the evaluation outcome. You can create and manage Judge evaluators in the Patronus platform to tailor evaluations to your use case.
- Patronus supports in-house judge evaluators applicable to a variety of use cases. Patronus judge evaluators have the
patronus:
prefix and have broad evaluation coverage, including tone of voice, format, safety and common constraints.
Evaluator Family
An evaluator family is a grouping of Evaluators that execute the evaluation.
- For example, the
judge
family groups together all judge evaluators - likejudge-large
andjudge-small
. They share the same criteria likepatronus:is-concise
.
Evaluator Alias
- Aliases let you refer to the latest and most advanced evaluator in the Evaluator Family, rather than specifying an exact version.
- For example, you can refer to the alias
judge-large
which always points to the newest large custom evaluator. For example, if you directly referencejudge-large-2024-05-16
, you'll need to manually update this tojudge-large-2024-11-14
when a new version is available.
Evaluation
An evaluation describes an execution of a performance assessment for an LLM or GenAI system. Evaluations produce evaluation results. Evaluations assess the accuracy, quality, format and capabilities of your AI system to identify risks and provide critical insights to improve system performance.
In Patronus, anΒ evaluation refers to an assessment of your LLM system at a single point in time. An evaluation takes in a snapshot of your LLM system, which includes the following information:
evaluated_model_input
: The user input to the model you are evaluatingevaluated_model_output
: The output of the model you are evaluatingevaluated_model_retrieved_context
: Any extra context passed to the model you are evaluating, like from a retrieval system.evaluated_model_gold_answer
: The "correct" or expected answer to the user input.
Evaluation Result
Evaluations produce evaluation results, which contain outputs from Evaluators. Evaluation results are returned in the output of the POST /v1/evaluate
API and can be viewed in the Logs dashboard. They are also returned by Evaluation Runs and the Patronus Experimentation SDK (both currently Enterprise features).
- Evaluation Results contain all inputs for the evaluation as well as a PASS/FAIL rating, raw score, and evaluator-specific metadata.
Updated 9 days ago