Tracing and Debugging Agents with Percival

For developers and product teams working with agentic systems, one of the biggest challenges is observability. Teams need to know:

Are my prompts effective?
Is the agent using the tools properly?
Where might the agent be failing in unexpected ways?

For developers, Patronus provides full visibility into agent behavior — including outputs, tool calls, and failure points.

For product teams, Patronus makes it easy to manage prompts across multiple agents, improve accuracy, and enforce observability as you scale into production.

Patronus provides:

Tracing for end-to-end visibility into agent runs
Prompt versioning & testing to iterate and deploy changes safely
Percival, an agentic debugger that flags failures (hallucinations, retrieval errors, tool misuse) and suggests fixes

In this walkthrough, we’ll apply these capabilities to a simple insurance claims agent. By the end, you’ll know how to:

Trace an agent
Version prompts
Use Percival insights to improve prompts

Percival insight 2 for header

Video walkthrough

Watch this complete demonstration of the debugging workflow:

Follow along with the detailed steps below to implement this workflow in your own projects.

For this example, we'll be using the OpenAI Agents SDK along with Patronus. Start by importing the required packages, initializing a Patronus project, and instrumenting OpenAI so all requests are traced.

import os, re, json, random, time
from datetime import datetime
from typing import Optional, Literal, Dict, Any, List
 
# --------------------------
# Patronus tracing
# --------------------------
from openinference.instrumentation.openai import OpenAIInstrumentor
import patronus
from patronus import traced, start_span
from patronus.evals import RemoteEvaluator
from patronus.experiments import run_experiment
from patronus.prompts import Prompt, push_prompt, load_prompt
 
# --------------------------
# OpenAI client
# --------------------------
from pydantic import BaseModel, Field
from openai import OpenAI
 
OPENAI_API_KEY = ""
MODEL = os.getenv("OPENAI_MODEL", "gpt-5")  # use your preferred gpt-5 variant
 
if not OPENAI_API_KEY:
    raise RuntimeError("Please set OPENAI_API_KEY in your environment.")
 
client = OpenAI(api_key=OPENAI_API_KEY)
 
PROJECT_NAME = "demo-video-claims-agent"
patronus.init(integrations=[OpenAIInstrumentor()], project_name=PROJECT_NAME)
log = patronus.get_logger()

We'll also define a small golden dataset in Patronus format to evaluate the agent against.

golden_dataset = [
    {
        "task_input": "New FNOL:\nPolicy: POL-2043\nDate: Aug 12, 2025\nIncident: Vehicle hail damage in Dallas, TX\nDescription: Heavy hail cracked windshield and dented hood.\nLocation: Dallas, TX",
        "gold_answer": "Final report should include policy POL-2043, 2025-08-12, Dallas. If weather=hail → Risk: medium; Coverage (non-binding): covered. Otherwise Risk: low; Coverage: uncertain. Must include a Weather line, and end with the disclaimer."
    },
    {
        "task_input": "New FNOL:\nPolicy: POL-3099\nDate: 2025-09-01\nIncident: Windstorm blew fence panels down\nDescription: Strong winds caused property damage to backyard fence.\nLocation: Tulsa, OK",
        "gold_answer": "Report lists policy POL-3099, 2025-09-01, Tulsa, incident description. Weather may be available or unavailable; if not available, a Noted failures line discloses weather_unavailable:<reason>. Risk defaults low; Coverage (non-binding) uncertain; disclaimer required."
    },
    {
        "task_input": "New FNOL:\nPolicy: AUTO-5511\nDate: Sep 3, 2025\nIncident: Hail damage during commute\nDescription: Golf-ball hail dents roof and hood.\nLocation: Aurora, CO",
        "gold_answer": "If weather returns hail: Weather: available: hail → Risk: medium; Coverage (non-binding): covered. Include incident text, date 2025-09-03, location Aurora. If weather unavailable, include Noted failures with weather_unavailable and keep Coverage uncertain. Always include disclaimer."
    },
]

1. Define a Simple Agent

Next, we'll define a few sample tools and decorate them with @traced. This ensures that every tool call's inputs and outputs are automatically logged to Patronus.

## Tools and Agent Definition
@traced(span_name="parse_claim")
def parse_claim(text: str) -> dict:
    """Grab fields from FNOL free text (super loose)."""
    def grab(label, default=""):
        m = re.search(rf"{label}:\s*(.+)", text, re.I)
        return (m.group(1).strip() if m else default)
 
    policy = grab("Policy", "POL-UNKNOWN")
    date_raw = grab("Date", "2025-08-12")
    loc = grab("Location", None)
    desc = grab("Description", grab("Incident", ""))
 
    # normalize date
    date_iso = None
    for fmt in ("%Y-%m-%d","%b %d, %Y","%b %d %Y"):
        try:
            date_iso = datetime.strptime(date_raw, fmt).date().isoformat()
            break
        except:
            pass
    if not date_iso: date_iso = "2025-08-12"
 
    return {"policy_id": policy, "date_iso": date_iso, "location": loc, "description": desc}
 
@traced(span_name="weather_lookup")
def weather_lookup(date_iso: str, location: str | None) -> dict:
    """Simulated weather (sometimes fails)."""
    r = random.random()
    if r < 0.20:
        time.sleep(0.2)
        return {"available": False, "conditions": None, "error": "timeout"}
    if r < 0.30:
        return {"available": False, "conditions": None, "error": "bad_response_format"}
    return {"available": True, "conditions": random.choice(["hail","clear skies","light rain"]), "error": None}
 
@traced(span_name="finalize_report")
def finalize_report(claim: dict, weather: dict) -> str:
    """Return a short, plain-text summary (non-binding)."""
    c, w = claim, weather
    risk, cov, notes = "low", "uncertain", []
    if w.get("available") and w.get("conditions") == "hail":
        risk, cov = "medium", "covered"
    if not w.get("available"):
        notes.append(f"weather_unavailable:{w.get('error') or 'no_data'}")
    return (
        f"Claim {c.get('policy_id')} on {c.get('date_iso')} ({c.get('location') or 'Unknown location'})\n"
        f"- Incident: {c.get('description')}\n"
        f"- Weather: {('available: '+w.get('conditions')) if w.get('available') else 'unavailable'}\n"
        f"- Risk: {risk}\n"
        f"- Coverage (non-binding): {cov}\n"
        + (f"- Noted failures: {', '.join(notes)}\n" if notes else "")
        + "Disclaimer: This is not a binding coverage determination."
    )
 
# --------------------- tool registry & schemas ---------------------
TOOL_IMPLS = {
    "parse_claim": lambda args: parse_claim(**args),
    "weather_lookup": lambda args: weather_lookup(**args),
    "finalize_report": lambda args: finalize_report(**args),
}
 
TOOLS = [
    {
        "type": "function",
        "function": {
            "name": "parse_claim",
            "description": "Parse FNOL free text into simple fields.",
            "parameters": {
                "type":"object",
                "properties":{"text":{"type":"string"}},
                "required":["text"]
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "weather_lookup",
            "description": "Simulated weather lookup for date/location.",
            "parameters": {
                "type":"object",
                "properties":{"date_iso":{"type":"string"},"location":{"type":["string","null"]}},
                "required":["date_iso"]
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "finalize_report",
            "description": "Return a plain-text claims summary for the adjuster.",
            "parameters": {
                "type":"object",
                "properties":{"claim":{"type":"object"},"weather":{"type":"object"}},
                "required":["claim","weather"]
            },
        },
    },
]

2. Version a Prompt

We'll define a system prompt for our agent, then push it to Patronus for versioning and pull it back down for use in the agent.

SYSTEM = """You're a tiny claims demo agent.
Call tools exactly once in this order, unless arguments are missing:
1) parse_claim(text)
2) weather_lookup(date_iso, location)
3) finalize_report(claim, weather)
Then reply with ONLY the final text report (no code fences).
"""
PROMPT_VERSION = 1
 
## Patronus Prompt Versioning
prompt = Prompt(
    name="demo/claims-agent-video/system",
    body=SYSTEM,
    description="Template for Percival claims agent walkthrough",
)
 
loaded_prompt = push_prompt(prompt)
 
# Now we can retrieve the prompt from the platform
RENDERED = load_prompt(name="demo/claims-agent-video/system", revision=PROMPT_VERSION).render()
print(RENDERED)

3. Define Agent Task and Run an Experiment

Next, we’ll define our agent’s task following the Patronus task standard.
A task is simply a function that takes in a row of golden data and returns a result.

Most of this code is boilerplate, but notice how we:

Add the @traced decorator so the task execution is logged,
Use log.info() to track inputs, outputs, and prompt versions.

# --------------------- minimal loop (Chat Completions) ---------------------
@traced(span_name="Simple Claims Agent Run")
def run_demo(row, debug=True, **kwargs) -> str:
    messages = [
        {"role":"system","content": RENDERED},
        {"role":"user","content": row.task_input},
    ]
    log.info({
        "system_prompt" : RENDERED,
        "prompt_version" : PROMPT_VERSION,
        "input" : row.task_input,
    })
 
    for step in range(8):
        with start_span(name = f"Chat Completion Loop {step}"):
            resp = client.chat.completions.create(
                model=MODEL,
                messages=messages,
                tools=TOOLS,  # classic function-calling schema
            )
            msg = resp.choices[0].message
 
            # If the model requested tool calls, execute them and add tool results
            tool_calls = getattr(msg, "tool_calls", None) or []
            if tool_calls:
                if debug:
                    print("🔧 tool calls:", [tc.function.name for tc in tool_calls])
                messages.append({"role":"assistant","content": msg.content or "", "tool_calls": [tc.model_dump() for tc in tool_calls]})
                for tc in tool_calls:
                    name = tc.function.name
                    args = {}
                    try:
                        args = json.loads(tc.function.arguments or "{}")
                    except Exception:
                        pass
                    try:
                        result = TOOL_IMPLS[name](args)
                    except Exception as e:
                        result = {"error": f"{type(e).__name__}: {e}"}
                    # tool result must be a string
                    messages.append({
                        "role":"tool",
                        "tool_call_id": tc.id,
                        "content": json.dumps(result),
                    })
                # loop again so the model can read tool outputs
                continue
 
            # No tool calls → we should have the final text
            text = msg.content or ""
            return text.strip() or "(no text produced)"
 
    return "(stopped after 8 steps)"

Now, we can run an experiment to assess how the agent performs on the golden dataset. We'll use a simple LLM fuzzy match judge, and all traces and results will automatically populate in the Patronus platform.

# run experiment
run_experiment(
    experiment_name= "demo-percival-claims-agent",
    project_name= PROJECT_NAME,
    dataset=golden_dataset,
    task=run_demo,
    evaluators=[
        # Use a Patronus-managed evaluator
        RemoteEvaluator("judge", "patronus:fuzzy-match").load(),
    ],
    tags={"model": "gpt5", "version": "v1"}
)

4. Investigate a Trace with Percival

Looking at a trace in Patronus, we see the agent produced the correct final answer but failed on its first tool call to finalize_report.

Trace screenshot

In production, this kind of silent failure can be costly. Percival provides insights into what went wrong:

The model wasn't passing the correct arguments
It suggests clarifying schema usage in the prompt

Percival insight 1 Percival insight 2

Let’s use these insights to draft a new prompt and see the intended behavior.

5. Update Prompt and Rerun

Based on Percival's feedback, we'll refine the system prompt to explicitly guide the model's tool usage.

# Update system prompt w Percival Suggestion!
NEW_SYSTEM = """You're a tiny claims demo agent.
Call tools in this exact sequence:
1) parse_claim(text) - this returns a claim object
2) weather_lookup(date_iso, location) - this returns a weather object  
3) finalize_report(claim, weather) - pass the EXACT claim and weather objects 
from steps 1 and 2
Then reply with ONLY the final text report (no code fences).
 
Important: When calling finalize_report, you MUST pass both the 'claim' object 
from parse_claim AND the 'weather' object from weather_lookup as arguments.
"""
 
PROMPT_VERSION = 2
 
# Push prompt to Patronus Platform
new_prompt = Prompt(
    name="demo/claims-agent-video/system",
    body=NEW_SYSTEM,
    description="Added fixes from Percival suggestions",
)
 
loaded_prompt = push_prompt(new_prompt)
 
# Pull new prompt from platform to confirm
RENDERED_V2 = load_prompt(name="demo/claims-agent-video/system", revision=PROMPT_VERSION).render()
print(RENDERED_V2)

Now we can redefine our task with the new prompt and prompt version:

# --------------------- minimal loop (Chat Completions) ---------------------
@traced(span_name="Simple Claims Agent Run w. Prompt Fix")
def run_demo_prompt_fix(row, debug=True, **kwargs) -> str:
    messages = [
        {"role":"system","content": RENDERED_V2},
        {"role":"user","content": row.task_input},
    ]
    log.info({
        "system_prompt" : RENDERED_V2,
        "prompt_version" : PROMPT_VERSION,
        "input" : row.task_input,
    })
 
    for step in range(8):
        with start_span(name = f"Chat Completion Loop {step}"):
            resp = client.chat.completions.create(
                model=MODEL,
                messages=messages,
                tools=TOOLS,  # classic function-calling schema
            )
            msg = resp.choices[0].message
 
            # If the model requested tool calls, execute them and add tool results
            tool_calls = getattr(msg, "tool_calls", None) or []
            if tool_calls:
                if debug:
                    print("🔧 tool calls:", [tc.function.name for tc in tool_calls])
                messages.append({"role":"assistant","content": msg.content or "", "tool_calls": [tc.model_dump() for tc in tool_calls]})
                for tc in tool_calls:
                    name = tc.function.name
                    args = {}
                    try:
                        args = json.loads(tc.function.arguments or "{}")
                    except Exception:
                        pass
                    try:
                        result = TOOL_IMPLS[name](args)
                    except Exception as e:
                        result = {"error": f"{type(e).__name__}: {e}"}
                    # tool result must be a string
                    messages.append({
                        "role":"tool",
                        "tool_call_id": tc.id,
                        "content": json.dumps(result),
                    })
                # loop again so the model can read tool outputs
                continue
 
            # No tool calls → we should have the final text
            text = msg.content or ""
            return text.strip() or "(no text produced)"
 
    return "(stopped after 8 steps)"
 
# re-run experiment
run_experiment(
    experiment_name= "demo-percival-claims-agent-with-percival-fixes",
    project_name= PROJECT_NAME,
    dataset=golden_dataset,
    task=run_demo_prompt_fix,
    evaluators=[
        # Use a Patronus-managed evaluator
        RemoteEvaluator("judge", "patronus:fuzzy-match").load(),
    ],
    tags={"model": "gpt5", "version": "v2"}
)

Re-running the experiment, we now see a clean trace. No tool call errors — the agent executes the workflow correctly on the first try. Thanks, Percival!

Clean trace

Wrap Up

This flow — version → trace → analyze → update prompt → re-run — is the standard developer loop for building reliable agents with Patronus.

Tracing shows you exactly where things break
Percival analysis gives actionable fixes
Prompt versioning ensures iteration is measurable, repeatable, and production-ready