Working with Tasks

In the Patronus Experimentation Framework, tasks are a fundamental component used to define how your model processes inputs during evaluations. A task function specifies the logic that transforms an evaluated_model_input into an evaluated_model_output. This function is wrapped with the @task decorator, similar to how evaluators are defined with @evaluator.

Creating a task

A task function in Patronus is a Python function that takes in a set of predefined arguments and returns the output. While the output is typically generated by a LLM, it can be any part of your AI system, such as the retrieved contexts or processed user queries. These arguments must follow a specific naming convention so that the framework can inject the appropriate values during execution.

Basic Task Example

Here’s a basic example of a task that calls the GPT-4o model using the OpenAI API:

from openai import OpenAI
from patronus import task

oai = OpenAI()


@task
def call_gpt(evaluated_model_system_prompt: str, evaluated_model_input: str) -> str:
    model = "gpt-4o"
    evaluated_model_output = (
        oai.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": evaluated_model_system_prompt},
                {"role": "user", "content": evaluated_model_input},
            ],
            temperature=0,
        )
        .choices[0]
        .message.content
    )
    return evaluated_model_output

In this example:

  • The @task decorator transforms the function definition into a task that can be passed to the Experimentation Framework.
  • The function call_gpt accepts two arguments:
    • evaluated_model_system_prompt, which is the system prompt passed to the GPT model.
    • evaluated_model_input, which is the user input to the GPT model.
  • Both arguments are taken from the dataset parameter provided to the Experiment.
  • The function calls OpenAI's GPT-4, using both the system prompt and the user input to generate a response.
  • The output, evaluated_model_output, is returned as a string.

Enhanced Task with Metadata

To capture more detailed information about the model's behavior and the parameters used during task execution, you can expand the task to return a TaskResult object. This allows you to include additional metadata alongside the model's output, providing richer context for your evaluations.

Here’s an expanded version of the previous task with tags added:

from openai import OpenAI
from patronus import task, TaskResult

oai = OpenAI()


@task
def call_gpt(evaluated_model_system_prompt: str, evaluated_model_input: str) -> TaskResult:
    model = "gpt-4o"
    params = {
        "temperature": 1,
        "max_tokens": 200,
    }
    evaluated_model_output = (
        oai.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": evaluated_model_system_prompt},
                {"role": "user", "content": evaluated_model_input},
            ],
            **params,
        )
        .choices[0]
        .message.content
    )
    return TaskResult(
        evaluated_model_output=evaluated_model_output,
        metadata={
            "evaluated_model_name": model,
            "evaluated_model_provider": "openai",
            "evaluated_model_params": params,
            "evaluated_model_selected_model": model,
        },
        tags={"task_type": "chat_completion", "language": "English"},
    )

In this expanded example:

  • The function processes the evaluated_model_system_prompt and evaluated_model_input, both provided from the dataset, and interacts with the GPT-4 model to generate the output.
  • Instead of returning the model's output as a simple string, the function returns a TaskResult object. This object encapsulates:
    • evaluated_model_output: The output generated by the model.
    • metadata: Arbitrary JSON data.
    • tags: Additional metadata. Tags are arbitrary key-value pairs.

Full Code Example

Below is a complete example demonstrating how to define and run a task using GPT-4 with a remote evaluator to assess the accuracy and relevance of the model's responses, including a challenging scenario designed to test the model's behavior under an unpredictable system prompt:

from openai import OpenAI
import textwrap
from patronus import Client, task, TaskResult

oai = OpenAI()
cli = Client()


@task
def call_gpt(evaluated_model_system_prompt: str, evaluated_model_input: str) -> TaskResult:
    model = "gpt-4o"
    params = {
        "temperature": 1,
        "max_tokens": 200,
    }
    evaluated_model_output = (
        oai.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": evaluated_model_system_prompt},
                {"role": "user", "content": evaluated_model_input},
            ],
            **params,
        )
        .choices[0]
        .message.content
    )
    return TaskResult(
        evaluated_model_output=evaluated_model_output,
        evaluated_model_system_prompt=evaluated_model_system_prompt,
        evaluated_model_name=model,
        evaluated_model_provider="openai",
        evaluated_model_params=params,
        evaluated_model_selected_model=model,
        tags={"task_type": "chat_completion", "language": "English"},
    )


evaluate_on_point = cli.remote_evaluator(
    "custom-large",
    "is-on-point",
    profile_config={
        "pass_criteria": textwrap.dedent(
            """
            The MODEL OUTPUT should accurately and concisely answer the USER INPUT.
            """
        ),
    },
    allow_update=True,
)

data = [
    {
        "evaluated_model_system_prompt": "You are a helpful assistant.",
        "evaluated_model_input": "How do I write a Python function?",
    },
    {
        "evaluated_model_system_prompt": "You are a knowledgeable assistant.",
        "evaluated_model_input": "Explain the concept of polymorphism in OOP.",
    },
    {
        "evaluated_model_system_prompt": "You are a creative poet who loves abstract ideas.",
        "evaluated_model_input": "What is 2 + 2?",
    },
]

cli.experiment(
    "Tutorial",
    data=data,
    task=call_gpt,
    evaluators=[evaluate_on_point],
    tags={"unit": "R&D", "version": "0.0.1"},
    experiment_name="OpenAI Task",
)

Evaluations without a Task

In some cases, you may already have a dataset that includes model outputs, and you simply want to evaluate those outputs without running a task. In such scenarios, you can omit the task parameter entirely from your experiment.

When you omit the task, the framework will skip the task execution phase completely and proceed directly to evaluating the provided outputs using the specified evaluators. It's important to note that in this case, the task_result parameter will be None in your evaluators, since no task was executed.

Here's an example of running an experiment without a task:

cli.experiment(
    "Project Name",
    data=[
        {
            "evaluated_model_input": "How do I write a Python function?",
            "evaluated_model_output": "def my_function():\n    pass",
        }
    ],
    evaluators=[...],
)

This approach is particularly useful when:

  • You want to assess the quality of outputs generated by an external system
  • You have outputs from previous experiments and want to apply new evaluators
  • You're working with historical data that already includes model outputs
  • You want to evaluate outputs without re-running potentially expensive model calls

When writing evaluators that might be used in taskless experiments, make sure to handle the case where task_result is None:

@evaluator
def my_evaluator(row: Row, task_result: TaskResult) -> bool:
    # Handle case where task_result is None
    if task_result is None:
        # Access output directly from row
        output = row.evaluated_model_output
    else:
        # Access output from task_result
        output = task_result.evaluated_model_output
        
    # Proceed with evaluation
    return evaluate_output(output)

Conditional Tasks

You can skip a task's execution by returning None. This will:

  • Skip all evaluators in the current chain link
  • Stop the chain execution for this dataset row
@task
def perform_sql_query(row: Row):
    if row.language != "sql":
        return None
    return execute_sql(row.query)

Task Definition

When creating a task, whether function-based or class-based, you can access various parameters that provide context about the task. These parameters must be named exactly as specified below:

  • row (patronus.Row): The complete row from the dataset, extending pandas.Series. It provides helper properties for well-defined fields like evaluated_model_system_prompt, evaluated_model_input, etc., making data access more convenient.

  • evaluated_model_system_prompt

  • evaluated_model_retrieved_context

  • evaluated_model_input

  • evaluated_model_output

  • evaluated_model_gold_answer

  • parent: Reference to results from previous chain links. This parameter is only available when using evaluation chaining in your experiment.

While raw data is accessible through the row parameter, we recommend that you use evaluated_model_* field names, as they are logged to the Patronus AI platform. This ensures better traceability of your evaluation process.

Tasks can return either str or TaskResult object. The framework will automatically convert str to TaskResult.

The TaskResult class is defined as follows:

class TaskResult:
    evaluated_model_output: str
    metadata: Optional[dict[str, typing.Any]] = None
    tags: Optional[dict[str, str]] = None