Concepts

What are annotations?

Annotations let you capture human feedback on your AI system's outputs. Think of them as a way to add human judgment to supplement automated evaluation - they help you understand quality in ways that metrics alone can't capture.

You can annotate different types of resources in Patronus:

Traces: Complete workflow executions
Spans: Individual operations within traces
Logs: Logged LLM interactions
Evaluations: Automated evaluation results
Trace insights: Percival-generated insights

Annotation criteria

Before you can annotate something, you define annotation criteria - a template that specifies what you're measuring and how. Criteria include:

Name and description: What the annotation measures
Annotation type: The format for collecting feedback (explained below)
Resource types: Which resources can use this criteria
Categories: Predefined options when using categorical or discrete types

Once you've defined criteria, anyone on your team can use it to annotate consistently.

Annotation types

Different annotation types suit different kinds of feedback:

Binary annotations

Simple yes/no or pass/fail evaluations for quality gates.

When to use: Quick quality checks, compliance verification, or acceptability judgments.

Examples:

Did the agent complete the task successfully?
Is this output acceptable for production?
Does this response contain sensitive information?

Continuous annotations

Numeric scores on an open-ended scale.

When to use: Measuring degrees of quality where you need fine-grained scoring.

Examples:

Relevance score (0-10)
Confidence level (0-100)
Response quality rating

Discrete annotations

Predefined numeric ratings with labels, like Likert scales.

When to use: Structured ratings where you want consistency across annotators.

Examples:

Quality: Poor (1), Fair (3), Good (5), Excellent (7)
User satisfaction levels
Helpfulness ratings

Categorical annotations

Classification into predefined text categories.

When to use: Categorizing outputs into specific types or classes.

Examples:

Error types: Timeout, Permission Denied, Logic Error
Intent categories: Question, Request, Complaint
Content classification: Safe, Borderline, Unsafe

Text annotations

Free-form written feedback.

When to use: Qualitative insights, detailed explanations, or improvement suggestions.

Examples:

Improvement suggestions
Qualitative observations
Detailed failure analysis

Common use cases

Quality review

Have team members review AI outputs to identify issues automated evaluations might miss. This is especially valuable for:

Spotting edge cases
Validating correctness in nuanced situations
Gathering qualitative feedback

Metric validation

Compare human annotations with automated evaluators to understand how well your automated metrics match human judgment:

Measure agreement between humans and LLM judges
Calibrate automated metrics
Identify where automated evaluation fails

Dataset creation

Create high-quality labeled datasets by having humans annotate:

Expected outputs for new test cases
Failure modes and error categories
Ground truth labels for training data

Workflow tracking

For multi-step agent workflows, annotate individual steps to understand:

Which steps succeed or fail
How well agents coordinate
Overall workflow quality

Best practices

Designing good annotation criteria

Use clear, descriptive names that indicate what you're measuring
Write comprehensive descriptions so annotators understand expectations
Choose the annotation type that matches your analysis needs
Consider whether criteria should apply to all resources or specific types

Running annotation workflows

Establish consistent guidelines so your team annotates uniformly
Encourage annotators to use the explanation field for context
Periodically check inter-annotator agreement to ensure consistency
Combine multiple annotation types for comprehensive evaluation

Next steps

See the detailed annotations guide for implementation
Learn about annotation types
Review best practices

Concepts

On this page