Our Python SDK got smarter. We developed a Typscript SDK too. We are updating our SDK code blocks. Python SDKhere.Typscript SDKhere.
Description
Agent Evaluation

Agent Observability

Agent observability in Patronus is the real time monitoring and evaluation of end-to-end agent executions. Embedding observability and evaluation in agent executions is important, because it can identify failures such as

  • Incorrect tool use
  • Failure to delegate a task
  • Unsatisfactory answers
  • Incorrect tool outputs

Observing and evaluating agents in Patronus is the process of embedding evaluators in agent executions so that agent behaviors can be continuously monitored and analyzed in the platform.

Embedding Evaluators in Agents

The first step in evaluating an agent is to define a set of evaluators. See the Evaluators section to understand the difference between class based and Patronus API evaluators.

Let's create an example coding agent using CrewAI. The agent calls a LLM API and retrieves a response (example uses OpenAI, but any LLM API is equivalent).

tool.py
from crewai_tools import BaseTool
from openai import OpenAI
 
 
class APICallTool(BaseTool):
    name: str = "OpenAI Tool"
    description: str = (
        "This tool calls the LLM API with a prompt and an optional system prompt. This function returns the response from the API."
    )
 
    client: ClassVar = OpenAI()
 
    def _run(self, system_prompt: str=None, prompt: str=None) -> str:
 
        response = self.client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
            "role": "system",
            "content": [
                {
                "type": "text",
                "text": system_prompt
                }
            ]
            },
            {
            "role": "user",
            "content": [
                {
                "type": "text",
                "text": prompt,
                }
            ]
            }
        ],
            max_tokens=4095,
            temperature=1,
            top_p=1,
            response_format={
                "type": "text"
            },
        )
        return response.choices[0].message.content

Suppose we want to evaluate the helpfulness of the agent response. To embed Patronus Evaluator we will need to do some preparation.

  1. Initialize the Python SDK
  2. Create an object that will safely perform background evaluations.
  3. Instantiate remote Patronus Evaluator
tool.py
import os
import patronus
 
patronus.init(
    # Optional, can also be set via environment variable or config file
    api_key=os.environ.get("PATRONUS_API_KEY")
)
 
patronus_client = patronus.Patronus()
 
ev_is_concise = patronus.RemoteEvaluator(
    "judge",
    "patronus:is-concise"
)
ev_is_concise.evaluate_bg(task_output="YOUR AGENT OUTPUT")

We can embed this evaluator in the tool call, right after the generation:

tool.py
import os
 
from crewai_tools import BaseTool
from openai import OpenAI
import patronus
 
patronus.init(
    # Optional, can also be set via environment variable or config file
    api_key=os.environ.get("PATRONUS_API_KEY")
)
 
oai = OpenAI()
patronus_client = patronus.Patronus()
 
ev_is_concise = patronus.RemoteEvaluator(
    "judge",
    "patronus:is-concise"
)
 
 
class APICallTool(BaseTool):
    name: str = "OpenAI Tool"
    description: str = (
        "This tool calls the LLM API with a prompt and an optional system prompt. "
        "This function returns the response from the API."
    )
 
    def _run(self, system_prompt: str = None, prompt: str = None) -> str:
        response = oai.chat.completions.create(
            model="gpt-4o",
            messages=[
                {
                    "role": "system",
                    "content": [
                        {
                            "type": "text",
                            "text": system_prompt
                        }
                    ]
                },
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "text",
                            "text": prompt,
                        }
                    ]
                }
            ],
            max_tokens=4095,
            temperature=1,
            top_p=1,
            response_format={
                "type": "text"
            },
        )
        llm_output = response.choices[0].message.content
        patronus_client.evaluate_bg(
            evaluators=[ev_is_concise],
            task_output=llm_output,
        )

In this implementation, we're using the patronus_client.evaluate_bg() method to evaluate the LLM's output using our conciseness evaluator. The evaluation happens in the background without blocking the tool's response, allowing the agent to continue its execution while the evaluation is processed. You may choose to use patronus_client.evaluate() instead if you want it to work in guardrails mode.

Visit the Python SDK documentation to learn more about various ways to define and use evaluators in your applications.

Now you can run the agent with crewai run, and you will see evaluation results populated in real time in Logs.

That's it! Now each agent execution that is triggered will also log outputs and evals to the Patronus logs dashboard.

Embedding evaluators in agent executions enables agent behaviors to be continuously monitored and analyzed our the platform. You can send alerts on failed agent outputs, filter for interesting examples and add it to your testing data, and re-try the agent response when there are failures. The possibilities are endless!

On this page