Our docs got a refresh! Check out the new content and improved navigation. For detailed API reference see our Python SDK docs and TypeScript SDK.
Description
Annotations

Concepts

Understanding annotations in Patronus AI

What are annotations?

Annotations let you capture human feedback on your AI system's outputs. Think of them as a way to add human judgment to supplement automated evaluation - they help you understand quality in ways that metrics alone can't capture.

You can annotate different types of resources in Patronus:

  • Traces: Complete workflow executions
  • Spans: Individual operations within traces
  • Logs: Logged LLM interactions
  • Evaluations: Automated evaluation results
  • Trace insights: Percival-generated insights

Annotation criteria

Before you can annotate something, you define annotation criteria - a template that specifies what you're measuring and how. Criteria include:

  • Name and description: What the annotation measures
  • Annotation type: The format for collecting feedback (explained below)
  • Resource types: Which resources can use this criteria
  • Categories: Predefined options when using categorical or discrete types

Once you've defined criteria, anyone on your team can use it to annotate consistently.

Annotation types

Different annotation types suit different kinds of feedback:

Binary annotations

Simple yes/no or pass/fail evaluations for quality gates.

When to use: Quick quality checks, compliance verification, or acceptability judgments.

Examples:

  • Did the agent complete the task successfully?
  • Is this output acceptable for production?
  • Does this response contain sensitive information?

Continuous annotations

Numeric scores on an open-ended scale.

When to use: Measuring degrees of quality where you need fine-grained scoring.

Examples:

  • Relevance score (0-10)
  • Confidence level (0-100)
  • Response quality rating

Discrete annotations

Predefined numeric ratings with labels, like Likert scales.

When to use: Structured ratings where you want consistency across annotators.

Examples:

  • Quality: Poor (1), Fair (3), Good (5), Excellent (7)
  • User satisfaction levels
  • Helpfulness ratings

Categorical annotations

Classification into predefined text categories.

When to use: Categorizing outputs into specific types or classes.

Examples:

  • Error types: Timeout, Permission Denied, Logic Error
  • Intent categories: Question, Request, Complaint
  • Content classification: Safe, Borderline, Unsafe

Text annotations

Free-form written feedback.

When to use: Qualitative insights, detailed explanations, or improvement suggestions.

Examples:

  • Improvement suggestions
  • Qualitative observations
  • Detailed failure analysis

Common use cases

Quality review

Have team members review AI outputs to identify issues automated evaluations might miss. This is especially valuable for:

  • Spotting edge cases
  • Validating correctness in nuanced situations
  • Gathering qualitative feedback

Metric validation

Compare human annotations with automated evaluators to understand how well your automated metrics match human judgment:

  • Measure agreement between humans and LLM judges
  • Calibrate automated metrics
  • Identify where automated evaluation fails

Dataset creation

Create high-quality labeled datasets by having humans annotate:

  • Expected outputs for new test cases
  • Failure modes and error categories
  • Ground truth labels for training data

Workflow tracking

For multi-step agent workflows, annotate individual steps to understand:

  • Which steps succeed or fail
  • How well agents coordinate
  • Overall workflow quality

Best practices

Designing good annotation criteria

  • Use clear, descriptive names that indicate what you're measuring
  • Write comprehensive descriptions so annotators understand expectations
  • Choose the annotation type that matches your analysis needs
  • Consider whether criteria should apply to all resources or specific types

Running annotation workflows

  • Establish consistent guidelines so your team annotates uniformly
  • Encourage annotators to use the explanation field for context
  • Periodically check inter-annotator agreement to ensure consistency
  • Combine multiple annotation types for comprehensive evaluation

Next steps