Agent Observability
Agent observability in Patronus is the real time monitoring and evaluation of end-to-end agent executions. Embedding observability and evaluation in agent executions is important, because it can identify failures such as
- Incorrect tool use
- Failure to delegate a task
- Unsatisfactory answers
- Incorrect tool outputs
Observing and evaluating agents in Patronus is the process of embedding evaluators in agent executions so that agent behaviors can be continuously monitored and analyzed in the platform.
Embedding Evaluators in Agents
The first step in evaluating an agent is to define a set of evaluators. See the Evaluators section to understand the difference between class based and Patronus API evaluators.
Let's create an example coding agent using CrewAI. The agent calls a LLM API and retrieves a response (example uses OpenAI, but any LLM API is equivalent).
Suppose we want to evaluate the helpfulness of the agent response. To embed Patronus Evaluator we will need to do some preparation.
- Initialize the Python SDK
- Create an object that will safely perform background evaluations.
- Instantiate remote Patronus Evaluator
We can embed this evaluator in the tool call, right after the generation:
In this implementation, we're using the patronus_client.evaluate_bg()
method to evaluate the LLM's output using our conciseness evaluator.
The evaluation happens in the background without blocking the tool's response, allowing the agent to continue its execution while the evaluation is processed.
You may choose to use patronus_client.evaluate()
instead if you want it to work in guardrails mode.
Visit the Python SDK documentation to learn more about various ways to define and use evaluators in your applications.
Now you can run the agent with crewai run
, and you will see evaluation results populated in real time in Logs.
That's it! Now each agent execution that is triggered will also log outputs and evals to the Patronus logs dashboard.
Embedding evaluators in agent executions enables agent behaviors to be continuously monitored and analyzed our the platform. You can send alerts on failed agent outputs, filter for interesting examples and add it to your testing data, and re-try the agent response when there are failures. The possibilities are endless!