Optimizing compute time for evaluations

Users of our real-time monitoring product supporting evaluations at scale typically have more stringent requirements around latency and volume. We recommend the following strategies to optimize for scalability and reduced latency.

1. Use small tier evaluators

Evaluator families support evaluator tiers containing evaluators of different sizes. The small evaluators are the most efficient and optimized for latency-sensitive use cases.

Examples: lynx-small, custom-small, retrieval-context-relevance-small

Using small tier evaluators can significantly reduce response times while maintaining high accuracy for most use cases.

2. Reduce explanation generation

By default, evaluation results contain natural language explanations. To reduce latency in online workflows, we recommend limiting when explanations are generated using the explain_strategy parameter.

Available options:

"never" - No explanations are generated (fastest)
"on-fail" - Only generate explanations for failed evaluations (recommended balance)
"on-success" - Only generate explanations for passed evaluations
"always" - Generate explanations for all evaluations (slowest)

Example

import os
import patronus
from patronus.evals import RemoteEvaluator
 
patronus.init(
    api_key=os.environ.get("PATRONUS_API_KEY")
)
 
# Create evaluator with explain_strategy for performance
patronus_evaluator = RemoteEvaluator(
    "lynx-small",
    "patronus:hallucination",
    explain_strategy="on-fail"  # Only generate explanations for failures
)
 
result = patronus_evaluator.evaluate(
    task_input="What is the largest animal in the world?",
    task_context="The blue whale is the largest known animal.",
    task_output="The giant sandworm."
)

3. Use async evaluation calls

Wrap your evaluation API calls in an async function to handle multiple evaluations concurrently and improve throughput.

Example

import os
import asyncio
import patronus
from patronus.evals import RemoteEvaluator
 
patronus.init(
    api_key=os.environ.get("PATRONUS_API_KEY")
)
 
patronus_evaluator = RemoteEvaluator(
    "lynx-small",
    "patronus:hallucination"
)
 
def run_evaluation():
    result = patronus_evaluator.evaluate(
        task_input="What is the largest animal in the world?",
        task_context="The blue whale is the largest known animal.",
        task_output="The giant sandworm."
    )
    return result
 
async def async_evaluation():
    result = await asyncio.to_thread(run_evaluation)
    print(f"Result: {result}")
 
# Run the async function
asyncio.run(async_evaluation())

This pattern allows you to run multiple evaluations concurrently without blocking, which is especially useful when processing batches of evaluations or handling high-volume production traffic.

Optimizing compute time for evaluations

1. Use small tier evaluators

2. Reduce explanation generation

Example

3. Use async evaluation calls

Example

On this page