Item

Understanding Data Preprocessing for Effective End-to-End Training of DNN

Gong, Ping
Ma, Yuxin
Li, Cheng
Ma, Xiaosong
Noh, Sam H.
Supervisor
Department
Computer Science
Embargo End Date
Type
Conference proceeding
Date
2026
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
In this paper, we primarily focus on understanding the data preprocessing pipeline for DNN training in the public cloud. First, we run experiments to test the performance implications of the two major data preprocessing methods using either raw data or record files. The preliminary results show that data preprocessing is a clear bottleneck, even with the most efficient software and hardware configuration enabled by NVIDIA DALI, a high-optimized data preprocessing library. Second, we identify the potential causes, exercise a variety of optimization methods, and present their pros and cons. We hope this work will shed light on the new co-design of “data storage, loading pipeline” and “training framework” and flexible resource configurations between them so that the resources can be fully exploited and performance can be maximized.
Citation
P. Gong, Y. Ma, C. Li, X. Ma, and S. H. Noh, “Understanding Data Preprocessing for Effective End-to-End Training of DNN,” Lecture Notes in Computer Science, vol. 16062 LNCS, pp. 308–317, 2026, doi: 10.1007/978-981-95-1021-4_23
Source
Lecture Notes in Computer Science
Conference
16th International Symposium on Advanced Parallel Processing Technologies, APPT 2025
Keywords
Subjects
Source
16th International Symposium on Advanced Parallel Processing Technologies, APPT 2025
Publisher
Springer Nature
Full-text link