Item

DATAGEN: unified synthetic dataset via large language models

Huang, Yue
Wu, Siyuan
Gao, Chujie
Chen, Dongping
Zhang, Qihui
Wan, Yao
Zhou, Tianyi
Xiao, Chaowei
Gao, Jianfeng
Sun, Lichao
... show 1 more
Supervisor
Department
Natural Language Processing
Embargo End Date
Type
Conference proceeding
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Large Language Models (LLMs) such as GPT-4 and Llama3 have significantly impacted various fields by enabling high-quality synthetic data generation and reducing dependence on expensive human-generated datasets. Despite this, challenges remain in the areas of generalization, controllability, diversity, and truthfulness within the existing generative frameworks. To address these challenges, this paper presents DATAGEN, a comprehensive LLM-powered framework designed to produce diverse, accurate, and highly controllable datasets. DATAGEN is adaptable, supporting all types of text datasets and enhancing the generative process through innovative mechanisms. To augment data diversity, DATAGEN incorporates an attribute-guided generation module and a group checking feature. For accuracy, it employs a code-based mathematical assessment for label verification alongside a retrieval-augmented generation technique for factual validation. The framework also allows for user-specified constraints, enabling customization of the data generation process to suit particular requirements. Extensive experiments demonstrate the superior quality of data generated by DATAGEN, and each module within DATAGEN plays a critical role in this enhancement. Additionally, DATAGEN is applied in two practical scenarios: benchmarking LLMs and data augmentation. The results indicate that DATAGEN effectively supports dynamic and evolving benchmarking and that data augmentation improves LLM capabilities in various domains, including agent-oriented abilities and reasoning skills. © 2025 13th International Conference on Learning Representations, ICLR 2025. All rights reserved.
Citation
Y. Huang et al., “DataGen: Unified Synthetic Dataset Generation via Large Language Models,” International Conference on Representation Learning, vol. 2025, pp. 63739–63773, May 2025
Source
13th International Conference on Learning Representations, ICLR 2025
Conference
13th International Conference on Learning Representations, ICLR 2025
Keywords
Subjects
Source
13th International Conference on Learning Representations, ICLR 2025
Publisher
International Conference on Learning Representations, ICLR
DOI
Full-text link