Working with Tasks
In the Patronus Experimentation Framework, tasks are a fundamental component used to define how your model processes inputs during evaluations. A task function specifies the logic that transforms an evaluated_model_input
into an evaluated_model_output
. This function is wrapped with the @task
decorator, similar to how evaluators are defined with @evaluator
.
Creating a task
A task function in Patronus is a Python function that takes in a set of predefined arguments and returns the output. While the output is typically generated by a LLM, it can be any part of your AI system, such as the retrieved contexts or processed user queries. These arguments must follow a specific naming convention so that the framework can inject the appropriate values during execution.
Basic Task Example
Here’s a basic example of a task that calls the GPT-4o model using the OpenAI API:
from openai import OpenAI
from patronus import task
oai = OpenAI()
@task
def call_gpt(evaluated_model_system_prompt: str, evaluated_model_input: str) -> str:
model = "gpt-4o"
evaluated_model_output = (
oai.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": evaluated_model_system_prompt},
{"role": "user", "content": evaluated_model_input},
],
temperature=0,
)
.choices[0]
.message.content
)
return evaluated_model_output
In this example:
- The
@task
decorator transforms the function definition into a task that can be passed to the Experimentation Framework. - The function
call_gpt
accepts two arguments:evaluated_model_system_prompt
, which is the system prompt passed to the GPT model.evaluated_model_input
, which is the user input to the GPT model.
- Both arguments are taken from the dataset parameter provided to the Experiment.
- The function calls OpenAI's GPT-4, using both the system prompt and the user input to generate a response.
- The output,
evaluated_model_output
, is returned as a string.
Enhanced Task with Metadata
To capture more detailed information about the model's behavior and the parameters used during task execution, you can expand the task to return a TaskResult
object. This allows you to include additional metadata alongside the model's output, providing richer context for your evaluations.
Here’s an expanded version of the previous task with tags added:
from openai import OpenAI
from patronus import task, TaskResult
oai = OpenAI()
@task
def call_gpt(evaluated_model_system_prompt: str, evaluated_model_input: str) -> TaskResult:
model = "gpt-4o"
params = {
"temperature": 1,
"max_tokens": 200,
}
evaluated_model_output = (
oai.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": evaluated_model_system_prompt},
{"role": "user", "content": evaluated_model_input},
],
**params,
)
.choices[0]
.message.content
)
return TaskResult(
evaluated_model_output=evaluated_model_output,
metadata={
"evaluated_model_name": model,
"evaluated_model_provider": "openai",
"evaluated_model_params": params,
"evaluated_model_selected_model": model,
},
tags={"task_type": "chat_completion", "language": "English"},
)
In this expanded example:
- The function processes the
evaluated_model_system_prompt
andevaluated_model_input
, both provided from the dataset, and interacts with the GPT-4 model to generate the output. - Instead of returning the model's output as a simple string, the function returns a
TaskResult
object. This object encapsulates:evaluated_model_output
: The output generated by the model.metadata
: Arbitrary JSON data.tags
: Additional metadata. Tags are arbitrary key-value pairs.
Full Code Example
Below is a complete example demonstrating how to define and run a task using GPT-4 with a remote evaluator to assess the accuracy and relevance of the model's responses, including a challenging scenario designed to test the model's behavior under an unpredictable system prompt:
from openai import OpenAI
import textwrap
from patronus import Client, task, TaskResult
oai = OpenAI()
cli = Client()
@task
def call_gpt(evaluated_model_system_prompt: str, evaluated_model_input: str) -> TaskResult:
model = "gpt-4o"
params = {
"temperature": 1,
"max_tokens": 200,
}
evaluated_model_output = (
oai.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": evaluated_model_system_prompt},
{"role": "user", "content": evaluated_model_input},
],
**params,
)
.choices[0]
.message.content
)
return TaskResult(
evaluated_model_output=evaluated_model_output,
evaluated_model_system_prompt=evaluated_model_system_prompt,
evaluated_model_name=model,
evaluated_model_provider="openai",
evaluated_model_params=params,
evaluated_model_selected_model=model,
tags={"task_type": "chat_completion", "language": "English"},
)
evaluate_on_point = cli.remote_evaluator(
"custom-large",
"is-on-point",
profile_config={
"pass_criteria": textwrap.dedent(
"""
The MODEL OUTPUT should accurately and concisely answer the USER INPUT.
"""
),
},
allow_update=True,
)
data = [
{
"evaluated_model_system_prompt": "You are a helpful assistant.",
"evaluated_model_input": "How do I write a Python function?",
},
{
"evaluated_model_system_prompt": "You are a knowledgeable assistant.",
"evaluated_model_input": "Explain the concept of polymorphism in OOP.",
},
{
"evaluated_model_system_prompt": "You are a creative poet who loves abstract ideas.",
"evaluated_model_input": "What is 2 + 2?",
},
]
cli.experiment(
"Tutorial",
data=data,
task=call_gpt,
evaluators=[evaluate_on_point],
tags={"unit": "R&D", "version": "0.0.1"},
experiment_name="OpenAI Task",
)
Evaluations without a Task
In some cases, you may already have a dataset that includes model outputs, and you simply want to evaluate those outputs without running a task. In such scenarios, you can omit the task parameter entirely from your experiment.
When you omit the task, the framework will skip the task execution phase completely and proceed directly to evaluating the provided outputs using the specified evaluators. It's important to note that in this case, the task_result
parameter will be None
in your evaluators, since no task was executed.
Here's an example of running an experiment without a task:
cli.experiment(
"Project Name",
data=[
{
"evaluated_model_input": "How do I write a Python function?",
"evaluated_model_output": "def my_function():\n pass",
}
],
evaluators=[...],
)
This approach is particularly useful when:
- You want to assess the quality of outputs generated by an external system
- You have outputs from previous experiments and want to apply new evaluators
- You're working with historical data that already includes model outputs
- You want to evaluate outputs without re-running potentially expensive model calls
When writing evaluators that might be used in taskless experiments, make sure to handle the case where task_result
is None
:
@evaluator
def my_evaluator(row: Row, task_result: TaskResult) -> bool:
# Handle case where task_result is None
if task_result is None:
# Access output directly from row
output = row.evaluated_model_output
else:
# Access output from task_result
output = task_result.evaluated_model_output
# Proceed with evaluation
return evaluate_output(output)
Conditional Tasks
You can skip a task's execution by returning None
. This will:
- Skip all evaluators in the current chain link
- Stop the chain execution for this dataset row
@task
def perform_sql_query(row: Row):
if row.language != "sql":
return None
return execute_sql(row.query)
Task Definition
When creating a task, whether function-based or class-based, you can access various parameters that provide context about the task. These parameters must be named exactly as specified below:
-
row
(patronus.Row
): The complete row from the dataset, extendingpandas.Series
. It provides helper properties for well-defined fields likeevaluated_model_system_prompt
,evaluated_model_input
, etc., making data access more convenient. -
evaluated_model_system_prompt
-
evaluated_model_retrieved_context
-
evaluated_model_input
-
evaluated_model_output
-
evaluated_model_gold_answer
-
parent
: Reference to results from previous chain links. This parameter is only available when using evaluation chaining in your experiment.
While raw data is accessible through the row
parameter, we recommend that you use evaluated_model_*
field names, as they are logged to the Patronus AI platform. This ensures better traceability of your evaluation process.
Tasks can return either str
or TaskResult
object. The framework will automatically convert str
to TaskResult
.
The TaskResult
class is defined as follows:
class TaskResult:
evaluated_model_output: str
metadata: Optional[dict[str, typing.Any]] = None
tags: Optional[dict[str, str]] = None
Updated about 1 month ago