Our docs got a refresh! Check out the new content and improved navigation. For detailed API reference see our Python SDK docs and TypeScript SDK.
Description
Evaluators

Patronus Evaluators

Patronus Evaluators are powerful pre-built evaluators that run on Patronus infrastructure. Each Patronus Evaluator produces an automated, independent assessment of an AI system's performance against a pre-defined requirement. Patronus Evaluators are industry-leading in accuracy and outperform alternatives on internal and external benchmarks.

How to use Patronus Evaluators

Patronus evaluators can be called via the Python SDK or Patronus API. To use the Python SDK, first install the library with pip install patronus.

Using RemoteEvaluator (SDK)

The easiest way to use Patronus evaluators in the SDK is through the RemoteEvaluator class:

from patronus import init
from patronus.evals import RemoteEvaluator
 
init()
 
# Create a hallucination detector
hallucination_checker = RemoteEvaluator(
    "lynx",
    "patronus:hallucination",
    explain_strategy="always"  # Control when explanations are generated
)
 
result = hallucination_checker.evaluate(
    task_input="What is the largest animal in the world?",
    task_output="The giant sandworm is the largest animal.",
    task_context="The blue whale is the largest known animal."
)
result.pretty_print()

Using the Evaluation API

You can call Patronus evaluators directly via the REST API using Python, TypeScript, or cURL:

import requests
 
url = "https://api.patronus.ai/v1/evaluate"
json = {
  "evaluators": [
    {
      "evaluator": "lynx",
      "criteria": "patronus:hallucination",
      "explain_strategy": "always"
    }
  ],
  "evaluated_model_retrieved_context": [
    "The blue whale is the largest known animal."
  ],
  "evaluated_model_input": "What is the largest animal in the world?",
  "evaluated_model_output": "The giant sandworm.",
  "tags": {"scenario": "onboarding"}
}
headers = {
  "X-API-KEY": "YOUR_API_KEY"
}
response = requests.post(url, json=json, headers=headers)
print(response.text)

This will produce an evaluation result containing the PASS/FAIL output, raw score, explanation (optional), and associated metadata.

Controlling Explanations

Explanations are justifications attached to evaluation results, typically generated by an LLM. Patronus evaluators support explanations by default, with options to control when they're generated.

The explain_strategy parameter controls when explanations are generated:

  • "never": No explanations are generated for any evaluation results
  • "on-fail": Only generates explanations for failed evaluations
  • "on-success": Only generates explanations for passed evaluations
  • "always" (default): Generates explanations for all evaluations
# Only generate explanations for failed evaluations
factual_checker = RemoteEvaluator(
    "lynx",
    "patronus:factual-accuracy",
    explain_strategy="on-fail"  # Only explain failures
)
 
# Never generate explanations (fastest option)
conciseness_checker = RemoteEvaluator(
    "judge",
    "patronus:conciseness",
    explain_strategy="never"  # No explanations
)

Performance Note: For optimizing latency in production environments, it's recommended to use either explain_strategy="never" or explain_strategy="on-fail" to reduce the number of generated explanations.

See Explanations for more details.

Using in Experiments

Remote evaluators integrate seamlessly with Patronus experiments:

from patronus import init
from patronus.evals import RemoteEvaluator
from patronus.experiments import run_experiment
 
init()
 
fuzzy_match = RemoteEvaluator("judge-small", "patronus:fuzzy-match")
exact_match = RemoteEvaluator("exact-match", "patronus:exact-match")
 
# Run an experiment with remote evaluators
experiment = run_experiment(
    dataset=dataset,
    task=my_task,
    evaluators=[fuzzy_match, exact_match],
)

Using with Tracing

Remote evaluators integrate seamlessly with Patronus tracing:

from patronus import init, traced
from patronus.evals import RemoteEvaluator
 
init()
 
@traced()
def generate_response(query: str) -> str:
    """Generate a response to a user query."""
    # In a real application, this would call an LLM
    return "The blue whale can grow up to 100 feet long and weigh 200 tons."
 
@traced()
def process_query(query: str):
    """Process a user query with evaluation."""
    response = generate_response(query)
 
    # Use a remote evaluator
    fact_checker = RemoteEvaluator(
        "lynx",
        "patronus:factual-accuracy",
        explain_strategy="on-fail"  # Only explain failures
    )
 
    # Evaluate the response
    result = fact_checker.evaluate(
        task_input=query,
        task_output=response
    )
 
    return {
        "query": query,
        "response": response,
        "factually_accurate": result.pass_,
        "accuracy_score": result.score,
        "explanation": result.explanation if not result.pass_ else "Passed"
    }
 
# Process a query
response = process_query("How big is a blue whale?")

Evaluation Results

Evaluators execute evaluations to produce evaluation results. An Evaluation Result consists of the following fields:

  • Pass result: All evaluators return a PASS/FAIL result. By filtering the results this way, you can focus only on failures for instance if that is what you are interested in.
  • Raw score: The raw score indicates the confidence of the evaluation, normalized 0-1.
  • Explanation: Natural language explanation or justification for why the pass result is PASS or FAIL.
  • Additional Info (optional): Additional information provided by the evaluation result, such as highlighted spans.

Additionally, evaluation results contain metadata to help you track and diagnose issues.

  • Evaluator: This is the evaluator that was used to produce the evaluation.
  • Tags: You can provide a dictionary of key value pairs in the API call to tag evaluations with metadata, such as the model version. You can filter results by these key value pairs in the UI.
  • Experiment ID: The experiment name associated with the evaluation, if available.
  • Dataset ID: The ID of the dataset, if provided.

On this page