Working with Tasks
In the Patronus Experimentation Framework, tasks are a fundamental component used to define how your model processes inputs during evaluations. A task function specifies the logic that transforms an evaluated_model_input
into an evaluated_model_output
. This function is wrapped with the @task
decorator, similar to how evaluators are defined with @evaluator
.
Defining a Task
A task function in Patronus is a Python function that takes in a set of predefined arguments and returns the output - typically the output is generated by a LLM. These arguments must follow a specific naming convention so that the framework can inject the appropriate values during execution.
Basic Task Example
Here’s a basic example of a task that calls the GPT-4o model using the OpenAI API:
from openai import OpenAI
from patronus import task
oai = OpenAI()
@task
def call_gpt(evaluated_model_system_prompt: str, evaluated_model_input: str) -> str:
model = "gpt-4o"
evaluated_model_output = (
oai.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": evaluated_model_system_prompt},
{"role": "user", "content": evaluated_model_input},
],
temperature=0,
)
.choices[0]
.message.content
)
return evaluated_model_output
In this example:
- The
@task
decorator transform the function into a task that could be used by the Patronus framework. - The function
call_gpt
accepts two arguments:evaluated_model_system_prompt
, which is the system prompt passed to the GPT model. This prompt comes from a provided dataset.evaluated_model_input
, which is the input also provided from the dataset.
- The function interacts with the GPT-4 model, using both the system prompt and the user input to generate a response.
- The output,
evaluated_model_output
, is returned as a string.
Expanding the Task
To capture more detailed information about the model's behavior and the parameters used during task execution, you can expand the task to return a TaskResult
object. This allows you to include additional metadata alongside the model's output, providing richer context for your evaluations.
Here’s an expanded version of the previous task with tags added:
from openai import OpenAI
from patronus import task
oai = OpenAI()
@task
def call_gpt(evaluated_model_system_prompt: str, evaluated_model_input: str) -> TaskResult:
model = "gpt-4o"
temp = 1
evaluated_model_output = (
oai.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": evaluated_model_system_prompt},
{"role": "user", "content": evaluated_model_input},
],
temperature=temp,
)
.choices[0]
.message.content
)
return TaskResult(
evaluated_model_output=evaluated_model_output,
evaluated_model_system_prompt=evaluated_model_system_prompt,
evaluated_model_name=model,
evaluated_model_provider="openai",
evaluated_model_params={"temperature": temp},
evaluated_model_selected_model=model,
tags={"task_type": "chat_completion", "language": "English"},
)
In this expanded example:
- The function processes the
evaluated_model_system_prompt
andevaluated_model_input
, both provided from the dataset, and interacts with the GPT-4 model to generate the output. - Instead of returning the model's output as a simple string, the function returns a
TaskResult
object. This object encapsulates:evaluated_model_output
: The output generated by the model.evaluated_model_system_prompt
: The system prompt used during the model's generation. This could be omitted since it's already provided in a dataset.evaluated_model_name
: The name of the model used, in this case, "gpt-4o".evaluated_model_provider
: The provider of the model, specified as "openai".evaluated_model_params
: The parameters used for the model, such as temperature, set to 1.evaluated_model_selected_model
: The specific model variant used in the task, "gpt-4o".tags
: Additional metadata. Tags are arbitrary key-value pairs.
Full Code Example
Below is a complete example demonstrating how to define and run a task using GPT-4 with a remote evaluator to assess the accuracy and relevance of the model's responses, including a challenging scenario designed to test the model's behavior under an unpredictable system prompt:
from openai import OpenAI
import textwrap
from patronus import Client, task, TaskResult
oai = OpenAI()
cli = Client()
@task
def call_gpt(evaluated_model_system_prompt: str, evaluated_model_input: str) -> TaskResult:
model = "gpt-4o"
params = {
"temperature": 1,
"max_tokens": 200,
}
evaluated_model_output = (
oai.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": evaluated_model_system_prompt},
{"role": "user", "content": evaluated_model_input},
],
**params,
)
.choices[0]
.message.content
)
return TaskResult(
evaluated_model_output=evaluated_model_output,
evaluated_model_system_prompt=evaluated_model_system_prompt,
evaluated_model_name=model,
evaluated_model_provider="openai",
evaluated_model_params=params,
evaluated_model_selected_model=model,
tags={"task_type": "chat_completion", "language": "English"},
)
evaluate_on_point = cli.remote_evaluator(
"custom-large",
"is-on-point",
profile_config={
"pass_criteria": textwrap.dedent(
"""
The MODEL OUTPUT should accurately and concisely answer the USER INPUT.
"""
),
},
allow_update=True,
)
data = [
{
"evaluated_model_system_prompt": "You are a helpful assistant.",
"evaluated_model_input": "How do I write a Python function?",
},
{
"evaluated_model_system_prompt": "You are a knowledgeable assistant.",
"evaluated_model_input": "Explain the concept of polymorphism in OOP.",
},
{
"evaluated_model_system_prompt": "You are a creative poet who loves abstract ideas.",
"evaluated_model_input": "What is 2 + 2?",
},
]
cli.experiment(
"Tutorial",
data=data,
task=call_gpt,
evaluators=[evaluate_on_point],
tags={"unit": "R&D", "version": "0.0.1"},
experiment_name="OpenAI Task",
)
Task parameter names
Just like with evaluators, the parameters in a task definition need to be named according to the specific conventions used by the Patronus framework. These include:
evaluated_model_system_prompt
evaluated_model_retrieved_context
evaluated_model_input
evaluated_model_output
evaluated_model_gold_answer
tags
All fields except tags
are passed from the dataset provided to the task
. The tags parameter, on the other hand, is populated with the tags defined in the experiment(..., tags={...})
call. This allows for additional metadata to be included and customized during each experiment.
Evaluations without a Task
In some cases, you may already have a dataset that includes model outputs, and you simply want to evaluate those outputs without running a task. In such scenarios, the task is not required, and you can omit it entirely from your experiment.
When you omit the task, the framework will simply evaluate the provided outputs using the evaluators. This approach is effectively the same as using a "no operation" (nop) task, which simply passes the model output through unchanged. This is equivalent to defining nop_task
:
@task
def nop_task(evaluated_model_output: str) -> str:
return evaluated_model_output
By omitting the task, you can directly focus on the evaluation of your pre-existing model outputs, as demonstrated in the evaluator examples.
This is useful when you want to assess the quality of outputs generated by an external system or when you already have outputs from previous experiments and want to apply new evaluators without rerunning the task.
cli.experiment(
"Project Name",
data=[
{
"evaluated_model_input": "How do I write a Python function?",
"evaluated_model_output": "def my_function():\n pass",
}
],
evaluators=[...],
)
Updated 13 days ago