Working with Tasks

In the Patronus Experimentation Framework, tasks are a fundamental component used to define how your model processes inputs during evaluations. A task function specifies the logic that transforms an evaluated_model_input into an evaluated_model_output. This function is wrapped with the @task decorator, similar to how evaluators are defined with @evaluator.

Defining a Task

A task function in Patronus is a Python function that takes in a set of predefined arguments and returns the output. While the output is typically generated by a LLM, it can be any part of your AI system, such as the retrieved contexts or processed user queries. These arguments must follow a specific naming convention so that the framework can inject the appropriate values during execution.

Basic Task Example

Here’s a basic example of a task that calls the GPT-4o model using the OpenAI API:

from openai import OpenAI
from patronus import task

oai = OpenAI()


@task
def call_gpt(evaluated_model_system_prompt: str, evaluated_model_input: str) -> str:
    model = "gpt-4o"
    evaluated_model_output = (
        oai.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": evaluated_model_system_prompt},
                {"role": "user", "content": evaluated_model_input},
            ],
            temperature=0,
        )
        .choices[0]
        .message.content
    )
    return evaluated_model_output

In this example:

  • The @task decorator transforms the function definition into a task that can be passed to the Experimentation Framework.
  • The function call_gpt accepts two arguments:
    • evaluated_model_system_prompt, which is the system prompt passed to the GPT model.
    • evaluated_model_input, which is the user input to the GPT model.
  • Both arguments are taken from the dataset parameter provided to the Experiment.
  • The function calls OpenAI's GPT-4, using both the system prompt and the user input to generate a response.
  • The output, evaluated_model_output, is returned as a string.

Expanding the Task

To capture more detailed information about the model's behavior and the parameters used during task execution, you can expand the task to return a TaskResult object. This allows you to include additional metadata alongside the model's output, providing richer context for your evaluations.

Here’s an expanded version of the previous task with tags added:

from openai import OpenAI
from patronus import task

oai = OpenAI()


@task
def call_gpt(evaluated_model_system_prompt: str, evaluated_model_input: str) -> TaskResult:
    model = "gpt-4o"
    temp = 1
    evaluated_model_output = (
        oai.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": evaluated_model_system_prompt},
                {"role": "user", "content": evaluated_model_input},
            ],
            temperature=temp,
        )
        .choices[0]
        .message.content
    )
    return TaskResult(
        evaluated_model_output=evaluated_model_output,
        evaluated_model_system_prompt=evaluated_model_system_prompt,
        evaluated_model_name=model,
        evaluated_model_provider="openai",
        evaluated_model_params={"temperature": temp},
        evaluated_model_selected_model=model,
        tags={"task_type": "chat_completion", "language": "English"},
    )

In this expanded example:

  • The function processes the evaluated_model_system_prompt and evaluated_model_input, both provided from the dataset, and interacts with the GPT-4 model to generate the output.
  • Instead of returning the model's output as a simple string, the function returns a TaskResult object. This object encapsulates:
    • evaluated_model_output: The output generated by the model.
    • evaluated_model_system_prompt: The system prompt used during the model's generation. This could be omitted since it's already provided in a dataset.
    • evaluated_model_name: The name of the model used, in this case, "gpt-4o".
    • evaluated_model_provider: The provider of the model, specified as "openai".
    • evaluated_model_params: The parameters used for the model, such as temperature, set to 1.
    • evaluated_model_selected_model: The specific model variant used in the task, "gpt-4o".
    • tags: Additional metadata. Tags are arbitrary key-value pairs.

Full Code Example

Below is a complete example demonstrating how to define and run a task using GPT-4 with a remote evaluator to assess the accuracy and relevance of the model's responses, including a challenging scenario designed to test the model's behavior under an unpredictable system prompt:

from openai import OpenAI
import textwrap
from patronus import Client, task, TaskResult

oai = OpenAI()
cli = Client()


@task
def call_gpt(evaluated_model_system_prompt: str, evaluated_model_input: str) -> TaskResult:
    model = "gpt-4o"
    params = {
        "temperature": 1,
        "max_tokens": 200,
    }
    evaluated_model_output = (
        oai.chat.completions.create(
            model=model,
            messages=[
                {"role": "system", "content": evaluated_model_system_prompt},
                {"role": "user", "content": evaluated_model_input},
            ],
            **params,
        )
        .choices[0]
        .message.content
    )
    return TaskResult(
        evaluated_model_output=evaluated_model_output,
        evaluated_model_system_prompt=evaluated_model_system_prompt,
        evaluated_model_name=model,
        evaluated_model_provider="openai",
        evaluated_model_params=params,
        evaluated_model_selected_model=model,
        tags={"task_type": "chat_completion", "language": "English"},
    )


evaluate_on_point = cli.remote_evaluator(
    "custom-large",
    "is-on-point",
    profile_config={
        "pass_criteria": textwrap.dedent(
            """
            The MODEL OUTPUT should accurately and concisely answer the USER INPUT.
            """
        ),
    },
    allow_update=True,
)

data = [
    {
        "evaluated_model_system_prompt": "You are a helpful assistant.",
        "evaluated_model_input": "How do I write a Python function?",
    },
    {
        "evaluated_model_system_prompt": "You are a knowledgeable assistant.",
        "evaluated_model_input": "Explain the concept of polymorphism in OOP.",
    },
    {
        "evaluated_model_system_prompt": "You are a creative poet who loves abstract ideas.",
        "evaluated_model_input": "What is 2 + 2?",
    },
]

cli.experiment(
    "Tutorial",
    data=data,
    task=call_gpt,
    evaluators=[evaluate_on_point],
    tags={"unit": "R&D", "version": "0.0.1"},
    experiment_name="OpenAI Task",
)

Task parameter names

Just like with evaluators, the parameters in a task definition need to be named according to the specific conventions used by the Patronus framework. These include:

  • evaluated_model_system_prompt
  • evaluated_model_retrieved_context
  • evaluated_model_input
  • evaluated_model_output
  • evaluated_model_gold_answer
  • tags

All fields except tags are passed from the dataset provided to the task. The tags parameter, on the other hand, is populated with the tags defined in the experiment(..., tags={...}) call. This allows for additional metadata to be included and customized during each experiment.

Evaluations without a Task

In some cases, you may already have a dataset that includes model outputs, and you simply want to evaluate those outputs without running a task. In such scenarios, the task is not required, and you can omit it entirely from your experiment.

When you omit the task, the framework will simply evaluate the provided outputs using the evaluators. This approach is effectively the same as using a "no operation" (nop) task, which simply passes the model output through unchanged. This is equivalent to defining nop_task:

@task
def nop_task(evaluated_model_output: str) -> str:
    return evaluated_model_output

By omitting the task, you can directly focus on the evaluation of your pre-existing model outputs, as demonstrated in the evaluator examples.

This is useful when you want to assess the quality of outputs generated by an external system or when you already have outputs from previous experiments and want to apply new evaluators without rerunning the task.

cli.experiment(
    "Project Name",
    data=[
        {
            "evaluated_model_input": "How do I write a Python function?",
            "evaluated_model_output": "def my_function():\n    pass",
        }
    ],
    evaluators=[...],
)