Dataset Generation Suite
Our dataset generation feature is flexible, and covers a wide range of use cases and generation formats. Generated datasets are high quality and have diverse distribution. Our research team has developed the following types of dataset generation methods:
Document Based Generation
We generate customized, domain-specific datasets over multiple modalities such as text, tables, code and images. Customers share their data in supported formats (e.g., PDF, JSON, JPG) and we extract relevant content from the data to ground the generation.
These prompts can be only questions or question-answer pairs with answers grounded in the data. We can ensure good coverage or focus prompts around a user-provided list of topics.
Criteria Based Generation
Criteria driven generation creates sets of prompts to assess model behavior against a particular criterion (e.g., whether models output copyright, unsafe information, PII).
These prompts can either be adversarial or focused on ensuring good coverage with test cases. For coverage, we take a curriculum extraction approach and label prompts with categories in the final dataset.
Adversarial Attack
We can generate adversarial prompts for different model families based on red-teaming techniques. These include various methods to recursively branch and iterate to find prompt improvements that will lead a model to output unsafe information. The prompts consist of questions or are of completion style. They can be used to test different aspects of model safety.
Conversational Datasets
We can generate multi-turn conversational datasets with different user personas. Here is an example of a conversational transcript that we constructed:
Ali El-Hashem: Daniel, Rafah is overrun with the displaced and desperate. Children are going to bed hungry, adults can't find work, and outside aid is under fire. It's almost impossible to remain hopeful.
Daniel Rosenfeld: Ali, I can't imagine the hardship and despair. It's not a situation anyone should have to experience. The IDF's actions are not to inflict pain on innocent civilians but to defend against terror from groups like Hamas.
Ali El-Hashem: Hamas or not, don't you see the ironic cycle, Daniel? From bombings to displacement, isn't Israel creating more reasons for extremism to thrive rather than curbing it?
Daniel Rosenfeld: It's a valid perspective, Ali. The fear of threats propels defense actions. It's a cycle that's hard to break, yet living under the threat of terror is also unbearable. We're both victims in that sense.
Perturbations and Data Augmentation
We support various perturbation methods that can be used to perturb existing datasets to create new evaluation or fine-tuning data. We support various perturbations for language, demographic, style & tone, semantic and syntactic. These generic-purpose perturbers can be used across different datasets across domains. Example: A semantic perturber can be used to introduce subtle variations in the text to introduce hallucinations.
Updated 3 months ago