Our Python SDK got smarter. We developed a Typscript SDK too. We are updating our SDK code blocks. Python SDKhere.Typscript SDKhere.
Description
Tutorials

Annotations

Comprehensive guide to creating and managing annotations

Human annotations are a critical component of AI evaluation systems, enabling engineers, domain experts, and end users to provide feedback on AI system outputs. Patronus annotations enable you to:

Quality Review

Human review AI system outputs in experiments or live monitoring

Metric Validation

Validate agreement of automated metrics such as LLM judges or Patronus evaluators

Comprehensive Feedback

Store comments on logs, spans, traces, and evaluations

Dataset Enhancement

Create more challenging test datasets and track quality metrics across agentic workflows


Core Concepts

Annotation Criteria

Annotation Criteria define the structure and validation rules for annotations. They act as templates that specify:

  • Name and Description: Human-readable labels explaining what the criteria measures
  • Annotation Type: The data collection method (binary, continuous, discrete, categorical, or text)
  • Resource Types: Which entities can be annotated using this criteria
  • Categories: Predefined options for discrete/categorical types

Resource Types

Annotation Criteria can be configured to work with specific resource types:

Trace

Entire trace spans representing complete workflow executions

Span

Individual spans within traces representing specific operations

Log

Log entries from system operations

Evaluations

Results from automated evaluators

Trace Insights

Insights generated by Percival from trace analysis.

Annotations

Annotations are the actual data points created using annotation criteria. Each annotation contains:

  • Type and Reference: What entity is being annotated (trace, span, log, evaluation, trace_insight) and its identifier
  • Values: The annotation data (pass/fail, score, or text)
  • Explanation: Optional reasoning for the annotation
  • Metadata: Timestamps, project association, and experiment context

Annotation Types

Binary Annotations

Purpose: Use for pass/fail evaluations, boolean decisions, or yes/no assessments. Perfect for quality gates, compliance checks, or simple success/failure tracking in agentic workflows.

{
  "annotation_type": "binary",
  "name": "Agent Task Success",
  "description": "Whether the agent successfully completed the assigned task without errors",
  "resource_types": ["trace", "span"]
}

Continuous Annotations

Purpose: Use for numeric scores on unlimited scales. Ideal for measuring performance metrics, confidence scores, or any quantitative assessment that doesn't fit predefined ranges.

{
  "annotation_type": "continuous",
  "name": "Response Relevance Score",
  "description": "Relevance score from 0.0 to 10.0 measuring how well the response addresses the user query",
  "resource_types": ["span"]
}

Discrete Annotations

Purpose: Use for ratings with predefined numeric values and labels. Perfect for Likert scales, quality ratings, or any evaluation where you want consistent scoring with human-readable labels.

{
  "annotation_type": "discrete",
  "name": "Agent Coordination Quality",
  "description": "How well multiple agents coordinated during task execution",
  "resource_types": ["trace"],
  "categories": [
    {"label": "Poor - Conflicts occurred", "score": 1.0},
    {"label": "Fair - Some coordination", "score": 3.0},
    {"label": "Good - Well coordinated", "score": 5.0},
    {"label": "Excellent - Seamless collaboration", "score": 7.0}
  ]
}

Categorical Annotations

Purpose: Use for classification into predefined text categories. Ideal for error types, content classification, or any scenario where you need to categorize rather than score.

{
  "annotation_type": "categorical",
  "name": "Error Type Classification",
  "description": "Categorize different types of errors that occur during agent execution",
  "resource_types": [],
  "categories": [
    {"label": "API Timeout"},
    {"label": "Invalid Input"},
    {"label": "Permission Denied"},
    {"label": "Logic Error"},
    {"label": "External Service Failure"}
  ]
}

Text Annotations

Purpose: Use for free-form feedback, detailed explanations, or any textual input that doesn't fit predefined categories. Perfect for qualitative feedback, improvement suggestions, or detailed observations.

{
  "annotation_type": "text_annotation",
  "name": "Performance Improvement Notes",
  "description": "Detailed suggestions for improving agent performance and user experience",
  "resource_types": ["trace", "span", "evaluation"]
}

Best Practices

Design Guidelines

  • Use descriptive names that clearly indicate what is being measured
  • Provide comprehensive descriptions explaining the evaluation criteria
  • Choose appropriate annotation types based on your analysis needs
  • Use resource_types strategically: leave empty for universal criteria, specify for targeted evaluation

Workflow Tips

  • Establish consistent annotation guidelines across your team
  • Use explanations to provide context for annotation decisions
  • Regularly review annotation quality and inter-annotator agreement
  • Combine multiple annotation types for comprehensive evaluation

On this page