Managing Datasets
Datasets are the entrypoint to LLM evaluation. Datasets can consist of inputs (prompts), gold labels (correct answers) and additional metadata.
There are multiple ways to use datasets in Patronus:
- Patronus datasets: These are off-the-shelf datasets you can use in your application to get started covering general use cases such as toxic inputs.
- Upload datasets: You can upload your own datasets to the platform and start using them in evaluations.
- Local datasets: Our experimentation framework supports ingesting data locally.
Upload Your Own Datasets
You can use Patronus to manage your suite of test datasets. We accept .jsonl
files that include the following string fields:
evaluated_model_input
: The prompt provided to your LLM agent or input to a task defined in the experimentation frameworkevaluated_model_gold_answer
(optional): The gold answer or expected output
Below is an example JSONL file you might want to upload for a medical help chatbot:
{"evaluated_model_input": "How do I get a better night's sleep?"}
{"evaluated_model_input": "What are common over the counter sleep aids?"}
{"evaluated_model_input": "What are common soccer injuries?"}
{"evaluated_model_input": "Does exercising before sleep help you sleep better?"}
{"evaluated_model_input": "Should I drink boiling water for a sore throat?"}
{"evaluated_model_input": "What does sleep insonmia mean?"}
{"evaluated_model_input": "Where can I find a good doctor?"}
{"evaluated_model_input": "What is the different cycles of sleep?"}
Once you upload the dataset, you'll see it in the Datasets view along with our off-the-shelf datasets.
You can now use this dataset in our experiments framework or to run evaluation runs in the platform! To run an experiment with this dataset, reference the dataset using the id
field, eg.
medical_chatbot_dataset = cli.remote_dataset("d-jxrisvlp1hgf786h")
cli.experiment(
"Project Name",
data=medical_chatbot_dataset,
task=task,
evaluators=[evaluator], # Replace with your evaluators
)
See working with datasets for more information.
Patronus Managed Datasets
We provide a number of off-the-shelf sample datasets that have been vetted for quality. These test datasets consist of 10-100 samples, and assess agents on general use cases including PII leakage and performance on real world domains.
We currently support the following off-the-shelf datasets:
pii-questions-1.0.0
: PII-eliciting promptstoxic-prompts-1.0.0
: Toxic prompts that an LLM might respond offensively tolegal-confidentiality-1.0.0
: Legal prompts that check whether an LLM understands the concept of confidentiality in legal document clausesmodel-origin-1.0.0-small
: OWASP security assessment checking whether LLMs leak information about model originsprompt-injections-1.0.0-small
: Prompt injections tests
You can download any of these datasets with Actions -> Download Dataset.
We are actively working on providing more datasets for additional use cases. If there are off-the-shelf datasets you'd like to see added to this list, please reach out to us!
Local Datasets
You can import and use locally stored datasets with the Patronus Python SDK. Local datasets can be stored in .csv, .jsonl format, or downloaded from HuggingFace or S3 storage. To use your locally stored datasets, simply map the fields to the following fields:
evaluated_model_system_prompt
(optional): The system prompt provided to the model, setting the context or behavior for the model's response.evaluated_model_retrieved_context
: A list of context strings (list[str]
) retrieved and provided to the model as additional information. This field is typically used in a Retrieval-Augmented Generation (RAG) setup, where the model's response depends on external context or supporting information that has been fetched from a knowledge base or similar source.evaluated_model_input
: Typically a user input provided to the model that it must respond to.evaluated_model_output
: The output generated by the model.evaluated_model_gold_answer
: The expected or correct answer that the model output is compared against during evaluation. This field is used to assess the accuracy and quality of the model's response.
For the full set of accepted parameters and examples of how to use local datasets in the python SDK, see the Datasets section in the Experimentation Framework.
Updated 19 days ago