Glossary
Logs
Logs contain data associated with AI application executions. A log is a single data sample, containing input/outputs of the AI application.
Dataset
A dataset is a collection of data samples (input/outputs of the AI application). Often, a dataset will contain the static inputs to your AI system, and the outputs of the AI system will be dynamically generated at run time for evals.
Evals
Evals refer to the process of evaluating AI application outputs. In its simplest term, you can think of an eval as a log + evaluation result + explanation (optional).
An evaluation describes an execution of a performance assessment for an LLM or GenAI system. Evaluations produce evaluation results. Evaluations assess the accuracy, quality, format and capabilities of your AI system to identify risks and provide critical insights to improve system performance.
In Patronus, an evaluation refers to an assessment of your LLM system at a single point in time. An evaluation takes in a snapshot of your LLM system, which includes the following information:
evaluated_model_input
: The user input to the model you are evaluatingevaluated_model_output
: The output of the model you are evaluatingevaluated_model_retrieved_context
: Any extra context passed to the model you are evaluating, like from a retrieval system.evaluated_model_gold_answer
: The "correct" or expected answer to the user input.
Evaluation Result
Evaluators produce evaluation results, which can be viewed in the Logs and Experiments dashboards. Evaluation Results contain all inputs for the evaluation as well as a PASS/FAIL rating, raw score, and evaluator-specific metadata.
Evaluators
Evaluators produce evaluation results on data. Evaluators can be powered by heuristic systems, classifiers, carefully tuned LLM judges or user-defined functions. For example, an evaluator scoring the relevance of retrieved chunks can be used to assess retriever quality in a RAG pipeline, and an evaluator detecting toxic outputs can be used to protect chatbot users from unsafe content.
- Patronus provides a suite of state-of-the-art Evaluators off-the-shelf covering a broad set of categories that are benchmarked for accuracy and human alignment.
- Some evaluators have different size and performance tiers, which you can tailor to use cases based on the task complexity and speed requirements. For example,
context-relevance-large
produces evaluation results with higher latency and supports more complex reasoning, compared tocontext-relevance-small
.
Judge Evaluators
Judge evaluators are LLM powered evaluators that execute evaluations on user defined criteria. A Judge evaluator must be instantiated with a criterion. For example, a no-apologies judge evaluator flags outputs that contain apologies, and the is-concise judge evaluator scores the conciseness of the output defines the:
Judge Evaluator Criterion
- Judge evaluators are tuned and aligned to user preferences with specified criteria. A criterion contains a natural language instruction that defines the evaluation outcome. You can create and manage Judge evaluators in the Patronus platform to tailor evaluations to your use case.
- Patronus supports in-house judge evaluators applicable to a variety of use cases. Patronus judge evaluators have the
patronus:
prefix and have broad evaluation coverage, including tone of voice, format, safety and common constraints.
Evaluator Family
An evaluator family is a grouping of Evaluators that execute the evaluation.
For example, the judge
family groups together all judge evaluators - like judge-large
and judge-small
. They share the same criteria like patronus:is-concise
.
Evaluator Alias
Aliases let you refer to the latest and most advanced evaluator in the Evaluator Family, rather than specifying an exact version.
For example, you can refer to the alias judge-large
which always points to the newest large custom evaluator. For example, if you directly reference judge-large-2024-05-16
, you'll need to manually update this to judge-large-2024-11-14
when a new version is available.
Annotation
Annotations are human annotated scores on logs, with comments or explanations (optional). Annotations can be categorical, binary, discrete, floats, or text-based.
Experiments
An experiment is a collection of logs. Experiments allow you to run batched evals to compare performance across different configurations, models, and datasets, so that you can make informed decisions to optimize performance of your AI applications.
Project
A project is a collection of experiments or apps. Projects maintain logical separation across different AI use cases within a team.
Account
Your account is your team's workspace. Accounts do not share visibility on resources. Multi-accounts are available to enterprise users. Accounts can host a collection of projects, datasets, evaluators, annotation criteria and more.