Chaining Evaluations

Evaluation chaining allows you to create sequential pipelines where the results of one evaluation step can be used in subsequent steps. This is particularly useful when you need to:

  • Process model outputs through multiple stages
  • Make evaluation decisions based on previous results
  • Create complex evaluation workflows that depend on earlier outcomes

Basic Chain Configuration

Here's a simple example of how to set up an evaluation chain:

from patronus import Client

client = Client()

client.experiment(
    "Tutorial Project",
    dataset=dataset,
    chain=[
        {"task": agent_sql_generator, "evaluators": [eval_sql_syntax, detect_sql_injection]},
        {"task": agent_sql_executor, "evaluators": [eval_output_correctness]},
    ]
)

Chain Execution Flow

  1. Links in the chain are executed sequentially
  2. Within each link:
    1. First, the task is executed
    2. If the task returns None, the chain execution stops for this dataset row
    3. If the task returns a result, all evaluators for this link are executed concurrently
  3. After all evaluators in a link complete, execution moves to the next link
  4. If any task raises an exception, the chain execution stops for this dataset row

Accessing Previous Results

Tasks and evaluators in the chain can access results from previous links using the parent parameter. Here's an example:

@evaluator
def parent_evaluator(row: Row):
    ...

@task
def second_task(row: Row, parent: EvalParent) -> str | None:
    if not parent:
        return None  # No parent means we're not in a chain

    # Access previous task's output
    previous_output = parent.task.evaluated_model_output if parent.task else None

    # Access specific evaluator's result from previous link
    parent_eval = parent.find_eval_result(parent_evaluator)
    if parent_eval and not parent_eval.pass_:
        return None  # Stop chain if parent evaluation failed

    # Process data using results from previous link
    return process_data(previous_output)

Best Practices

  1. Early Termination: Use task result None to stop chain execution when further processing would be unnecessary or invalid.
  2. Result Propagation: Pass relevant information through task results and evaluations to make it available to subsequent chain links.
  3. Error Handling: Implement retry mechanisms for unreliable operations using the retry helper.