Scalable Tabular Dataset Labeling with Language Models
Hu, Yaojie ; Fountalis, Ilias ; Tian, Jin ; Vasiloglou, Nikolaos
Hu, Yaojie
Fountalis, Ilias
Tian, Jin
Vasiloglou, Nikolaos
Supervisor
Department
Machine Learning
Embargo End Date
Type
Conference proceeding
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Acquiring tabular datasets for machine learning research can be surprisingly challenging. For example, executable SQL code paired with tabular data is still limited by manual annotations by human experts, which limits its scale. Furthermore, tabular machine learning (ML) problems often impose domain-specific constraints, making customized datasets even harder to obtain. To tackle the tabular dataset labeling problem, we propose a scalable and customizable methodology that synthesizes tabular ML datasets using language model annotations. Our approach leverages the abundance of existing tabular data and the instruction-following capabilities of language models to generate datasets tailored to specific research needs. We apply this method to create a three-part dataset for three tabular ML studies, progressively exploring how language models can augment tabular data. Notably, our method has produced the largest executable SQL dataset to date. In these studies, we demonstrate the scientific utility of our augmented datasets and assess the correctness of LLM-synthesized labels, highlighting the models’ ability to understand tabular data and support ML research.
Citation
Y. Hu, I. Fountalis, J. Tian, and N. Vasiloglou, “Scalable Tabular Dataset Labeling with Language Models,” Proceedings of the 37th International Conference on Scalable Scientific Data Management, pp. 1–6, Jun. 2025, doi: 10.1145/3733723.3733739
Source
Proceedings of the 37th International Conference on Scalable Scientific Data Management
Conference
SSDBM 2025: 37th International Conference on Scalable Scientific Data Management
Keywords
Subjects
Source
SSDBM 2025: 37th International Conference on Scalable Scientific Data Management
Publisher
Association for Computing Machinery
