Annotations
Comprehensive guide to creating and managing annotations
Human annotations are a critical component of AI evaluation systems, enabling engineers, domain experts, and end users to provide feedback on AI system outputs. Patronus annotations enable you to:
Quality Review
Human review AI system outputs in experiments or live monitoring
Metric Validation
Validate agreement of automated metrics such as LLM judges or Patronus evaluators
Comprehensive Feedback
Store comments on logs, spans, traces, and evaluations
Dataset Enhancement
Create more challenging test datasets and track quality metrics across agentic workflows
Core Concepts
Annotation Criteria
Annotation Criteria define the structure and validation rules for annotations. They act as templates that specify:
- Name and Description: Human-readable labels explaining what the criteria measures
- Annotation Type: The data collection method (binary, continuous, discrete, categorical, or text)
- Resource Types: Which entities can be annotated using this criteria
- Categories: Predefined options for discrete/categorical types
Resource Types
Annotation Criteria can be configured to work with specific resource types:
Trace
Entire trace spans representing complete workflow executions
Span
Individual spans within traces representing specific operations
Log
Log entries from system operations
Evaluations
Results from automated evaluators
Trace Insights
Insights generated by Percival from trace analysis.
Annotations
Annotations are the actual data points created using annotation criteria. Each annotation contains:
- Type and Reference: What entity is being annotated (trace, span, log, evaluation, trace_insight) and its identifier
- Values: The annotation data (pass/fail, score, or text)
- Explanation: Optional reasoning for the annotation
- Metadata: Timestamps, project association, and experiment context
Annotation Types
Binary Annotations
Purpose: Use for pass/fail evaluations, boolean decisions, or yes/no assessments. Perfect for quality gates, compliance checks, or simple success/failure tracking in agentic workflows.
Continuous Annotations
Purpose: Use for numeric scores on unlimited scales. Ideal for measuring performance metrics, confidence scores, or any quantitative assessment that doesn't fit predefined ranges.
Discrete Annotations
Purpose: Use for ratings with predefined numeric values and labels. Perfect for Likert scales, quality ratings, or any evaluation where you want consistent scoring with human-readable labels.
Categorical Annotations
Purpose: Use for classification into predefined text categories. Ideal for error types, content classification, or any scenario where you need to categorize rather than score.
Text Annotations
Purpose: Use for free-form feedback, detailed explanations, or any textual input that doesn't fit predefined categories. Perfect for qualitative feedback, improvement suggestions, or detailed observations.
Best Practices
Design Guidelines
- Use descriptive names that clearly indicate what is being measured
- Provide comprehensive descriptions explaining the evaluation criteria
- Choose appropriate annotation types based on your analysis needs
- Use resource_types strategically: leave empty for universal criteria, specify for targeted evaluation
Workflow Tips
- Establish consistent annotation guidelines across your team
- Use explanations to provide context for annotation decisions
- Regularly review annotation quality and inter-annotator agreement
- Combine multiple annotation types for comprehensive evaluation