TutorialsDatasets
Working with Large Datasets
Guidelines for efficiently processing datasets with more than 30k rows
For larger datasets (> 30k rows), we recommend users import and use locally stored datasets with the Patronus Python SDK. Local datasets can be stored in .csv, .jsonl format, or downloaded from HuggingFace or S3 storage. To use your locally stored datasets, simply map the fields to the following standard fields:
system_prompt
(optional): The system prompt provided to the model, setting the context or behavior for the model's response.task_context
: Additional information or context provided to the model. This can be a string or list of strings, typically used in Retrieval-Augmented Generation (RAG) setups, where the model's response depends on external information that has been fetched from a knowledge base or similar source.task_input
: The primary input to the model or task, typically a user query or instruction.task_output
: The output generated by the model or task being evaluated.gold_answer
: The expected or correct answer that the model output is compared against during evaluation. This field is used to assess the accuracy and quality of the model's response.tags
(optional): Key-value pairs for categorizing and filtering samples.task_metadata
(optional): Additional structured information about the task.
Here's a simple example of mapping fields when loading a large CSV file:
For the full set of accepted parameters and examples of how to use local datasets in the Python SDK, see the Using Datasets section in the Experimentation Framework documentation.