Our Python SDK got smarter. We developed a Typscript SDK too. We are updating our SDK code blocks. Python SDKhere.Typscript SDKhere.
Customer Flows
Tracing and Debugging Agents with Percival
For developers and product teams working with agentic systems, one of the biggest challenges is observability. Teams need to know:
Are my prompts effective?
Is the agent using the tools properly?
Where might the agent be failing in unexpected ways?
For developers, Patronus provides full visibility into agent behavior — including outputs, tool calls, and failure points.
For product teams, Patronus makes it easy to manage prompts across multiple agents, improve accuracy, and enforce observability as you scale into production.
Patronus provides:
Tracing for end-to-end visibility into agent runs
Prompt versioning & testing to iterate and deploy changes safely
Percival, an agentic debugger that flags failures (hallucinations, retrieval errors, tool misuse) and suggests fixes
In this walkthrough, we’ll apply these capabilities to a simple insurance claims agent. By the end, you’ll know how to:
For this example, we'll be using the OpenAI Agents SDK along with Patronus. Start by importing the required packages, initializing a Patronus project, and instrumenting OpenAI so all requests are traced.
We'll also define a small golden dataset in Patronus format to evaluate the agent against.
golden_dataset = [ { "task_input": "New FNOL:\nPolicy: POL-2043\nDate: Aug 12, 2025\nIncident: Vehicle hail damage in Dallas, TX\nDescription: Heavy hail cracked windshield and dented hood.\nLocation: Dallas, TX", "gold_answer": "Final report should include policy POL-2043, 2025-08-12, Dallas. If weather=hail → Risk: medium; Coverage (non-binding): covered. Otherwise Risk: low; Coverage: uncertain. Must include a Weather line, and end with the disclaimer." }, { "task_input": "New FNOL:\nPolicy: POL-3099\nDate: 2025-09-01\nIncident: Windstorm blew fence panels down\nDescription: Strong winds caused property damage to backyard fence.\nLocation: Tulsa, OK", "gold_answer": "Report lists policy POL-3099, 2025-09-01, Tulsa, incident description. Weather may be available or unavailable; if not available, a Noted failures line discloses weather_unavailable:<reason>. Risk defaults low; Coverage (non-binding) uncertain; disclaimer required." }, { "task_input": "New FNOL:\nPolicy: AUTO-5511\nDate: Sep 3, 2025\nIncident: Hail damage during commute\nDescription: Golf-ball hail dents roof and hood.\nLocation: Aurora, CO", "gold_answer": "If weather returns hail: Weather: available: hail → Risk: medium; Coverage (non-binding): covered. Include incident text, date 2025-09-03, location Aurora. If weather unavailable, include Noted failures with weather_unavailable and keep Coverage uncertain. Always include disclaimer." },]
Next, we'll define a few sample tools and decorate them with @traced.
This ensures that every tool call's inputs and outputs are automatically logged to Patronus.
## Tools and Agent Definition@traced(span_name="parse_claim")def parse_claim(text: str) -> dict: """Grab fields from FNOL free text (super loose).""" def grab(label, default=""): m = re.search(rf"{label}:\s*(.+)", text, re.I) return (m.group(1).strip() if m else default) policy = grab("Policy", "POL-UNKNOWN") date_raw = grab("Date", "2025-08-12") loc = grab("Location", None) desc = grab("Description", grab("Incident", "")) # normalize date date_iso = None for fmt in ("%Y-%m-%d","%b %d, %Y","%b %d %Y"): try: date_iso = datetime.strptime(date_raw, fmt).date().isoformat() break except: pass if not date_iso: date_iso = "2025-08-12" return {"policy_id": policy, "date_iso": date_iso, "location": loc, "description": desc}@traced(span_name="weather_lookup")def weather_lookup(date_iso: str, location: str | None) -> dict: """Simulated weather (sometimes fails).""" r = random.random() if r < 0.20: time.sleep(0.2) return {"available": False, "conditions": None, "error": "timeout"} if r < 0.30: return {"available": False, "conditions": None, "error": "bad_response_format"} return {"available": True, "conditions": random.choice(["hail","clear skies","light rain"]), "error": None}@traced(span_name="finalize_report")def finalize_report(claim: dict, weather: dict) -> str: """Return a short, plain-text summary (non-binding).""" c, w = claim, weather risk, cov, notes = "low", "uncertain", [] if w.get("available") and w.get("conditions") == "hail": risk, cov = "medium", "covered" if not w.get("available"): notes.append(f"weather_unavailable:{w.get('error') or 'no_data'}") return ( f"Claim {c.get('policy_id')} on {c.get('date_iso')} ({c.get('location') or 'Unknown location'})\n" f"- Incident: {c.get('description')}\n" f"- Weather: {('available: '+w.get('conditions')) if w.get('available') else 'unavailable'}\n" f"- Risk: {risk}\n" f"- Coverage (non-binding): {cov}\n" + (f"- Noted failures: {', '.join(notes)}\n" if notes else "") + "Disclaimer: This is not a binding coverage determination." )# --------------------- tool registry & schemas ---------------------TOOL_IMPLS = { "parse_claim": lambda args: parse_claim(**args), "weather_lookup": lambda args: weather_lookup(**args), "finalize_report": lambda args: finalize_report(**args),}TOOLS = [ { "type": "function", "function": { "name": "parse_claim", "description": "Parse FNOL free text into simple fields.", "parameters": { "type":"object", "properties":{"text":{"type":"string"}}, "required":["text"] }, }, }, { "type": "function", "function": { "name": "weather_lookup", "description": "Simulated weather lookup for date/location.", "parameters": { "type":"object", "properties":{"date_iso":{"type":"string"},"location":{"type":["string","null"]}}, "required":["date_iso"] }, }, }, { "type": "function", "function": { "name": "finalize_report", "description": "Return a plain-text claims summary for the adjuster.", "parameters": { "type":"object", "properties":{"claim":{"type":"object"},"weather":{"type":"object"}}, "required":["claim","weather"] }, }, },]
We'll define a system prompt for our agent, then push it to Patronus for versioning and pull it back down for use in the agent.
SYSTEM = """You're a tiny claims demo agent.Call tools exactly once in this order, unless arguments are missing:1) parse_claim(text)2) weather_lookup(date_iso, location)3) finalize_report(claim, weather)Then reply with ONLY the final text report (no code fences)."""PROMPT_VERSION = 1## Patronus Prompt Versioningprompt = Prompt( name="demo/claims-agent-video/system", body=SYSTEM, description="Template for Percival claims agent walkthrough",)loaded_prompt = push_prompt(prompt)# Now we can retrieve the prompt from the platformRENDERED = load_prompt(name="demo/claims-agent-video/system", revision=PROMPT_VERSION).render()print(RENDERED)
Next, we’ll define our agent’s task following the Patronus task standard.
A task is simply a function that takes in a row of golden data and returns a result.
Most of this code is boilerplate, but notice how we:
Add the @traced decorator so the task execution is logged,
Use log.info() to track inputs, outputs, and prompt versions.
# --------------------- minimal loop (Chat Completions) ---------------------@traced(span_name="Simple Claims Agent Run")def run_demo(row, debug=True, **kwargs) -> str: messages = [ {"role":"system","content": RENDERED}, {"role":"user","content": row.task_input}, ] log.info({ "system_prompt" : RENDERED, "prompt_version" : PROMPT_VERSION, "input" : row.task_input, }) for step in range(8): with start_span(name = f"Chat Completion Loop {step}"): resp = client.chat.completions.create( model=MODEL, messages=messages, tools=TOOLS, # classic function-calling schema ) msg = resp.choices[0].message # If the model requested tool calls, execute them and add tool results tool_calls = getattr(msg, "tool_calls", None) or [] if tool_calls: if debug: print("🔧 tool calls:", [tc.function.name for tc in tool_calls]) messages.append({"role":"assistant","content": msg.content or "", "tool_calls": [tc.model_dump() for tc in tool_calls]}) for tc in tool_calls: name = tc.function.name args = {} try: args = json.loads(tc.function.arguments or "{}") except Exception: pass try: result = TOOL_IMPLS[name](args) except Exception as e: result = {"error": f"{type(e).__name__}: {e}"} # tool result must be a string messages.append({ "role":"tool", "tool_call_id": tc.id, "content": json.dumps(result), }) # loop again so the model can read tool outputs continue # No tool calls → we should have the final text text = msg.content or "" return text.strip() or "(no text produced)" return "(stopped after 8 steps)"
Now, we can run an experiment to assess how the agent performs on the golden dataset.
We'll use a simple LLM fuzzy match judge, and all traces and results will automatically populate in the Patronus platform.
# run experimentrun_experiment( experiment_name= "demo-percival-claims-agent", project_name= PROJECT_NAME, dataset=golden_dataset, task=run_demo, evaluators=[ # Use a Patronus-managed evaluator RemoteEvaluator("judge", "patronus:fuzzy-match").load(), ], tags={"model": "gpt5", "version": "v1"})
Based on Percival's feedback, we'll refine the system prompt to explicitly guide the model's tool usage.
# Update system prompt w Percival Suggestion!NEW_SYSTEM = """You're a tiny claims demo agent.Call tools in this exact sequence:1) parse_claim(text) - this returns a claim object2) weather_lookup(date_iso, location) - this returns a weather object 3) finalize_report(claim, weather) - pass the EXACT claim and weather objects from steps 1 and 2Then reply with ONLY the final text report (no code fences).Important: When calling finalize_report, you MUST pass both the 'claim' object from parse_claim AND the 'weather' object from weather_lookup as arguments."""PROMPT_VERSION = 2# Push prompt to Patronus Platformnew_prompt = Prompt( name="demo/claims-agent-video/system", body=NEW_SYSTEM, description="Added fixes from Percival suggestions",)loaded_prompt = push_prompt(new_prompt)# Pull new prompt from platform to confirmRENDERED_V2 = load_prompt(name="demo/claims-agent-video/system", revision=PROMPT_VERSION).render()print(RENDERED_V2)
Now we can redefine our task with the new prompt and prompt version:
# --------------------- minimal loop (Chat Completions) ---------------------@traced(span_name="Simple Claims Agent Run w. Prompt Fix")def run_demo_prompt_fix(row, debug=True, **kwargs) -> str: messages = [ {"role":"system","content": RENDERED_V2}, {"role":"user","content": row.task_input}, ] log.info({ "system_prompt" : RENDERED_V2, "prompt_version" : PROMPT_VERSION, "input" : row.task_input, }) for step in range(8): with start_span(name = f"Chat Completion Loop {step}"): resp = client.chat.completions.create( model=MODEL, messages=messages, tools=TOOLS, # classic function-calling schema ) msg = resp.choices[0].message # If the model requested tool calls, execute them and add tool results tool_calls = getattr(msg, "tool_calls", None) or [] if tool_calls: if debug: print("🔧 tool calls:", [tc.function.name for tc in tool_calls]) messages.append({"role":"assistant","content": msg.content or "", "tool_calls": [tc.model_dump() for tc in tool_calls]}) for tc in tool_calls: name = tc.function.name args = {} try: args = json.loads(tc.function.arguments or "{}") except Exception: pass try: result = TOOL_IMPLS[name](args) except Exception as e: result = {"error": f"{type(e).__name__}: {e}"} # tool result must be a string messages.append({ "role":"tool", "tool_call_id": tc.id, "content": json.dumps(result), }) # loop again so the model can read tool outputs continue # No tool calls → we should have the final text text = msg.content or "" return text.strip() or "(no text produced)" return "(stopped after 8 steps)"# re-run experimentrun_experiment( experiment_name= "demo-percival-claims-agent-with-percival-fixes", project_name= PROJECT_NAME, dataset=golden_dataset, task=run_demo_prompt_fix, evaluators=[ # Use a Patronus-managed evaluator RemoteEvaluator("judge", "patronus:fuzzy-match").load(), ], tags={"model": "gpt5", "version": "v2"})
Re-running the experiment, we now see a clean trace.
No tool call errors — the agent executes the workflow correctly on the first try. Thanks, Percival!