Item

Scaling long context training data by long-distance referrals

Zhuang, Yonghao
Hu, Lanxiang
Yun, Longfei
Kundu, Souvik
Liu, Zhengzhong
Xing, Eric P.
Zhang, Hao
Supervisor
Department
Machine Learning
Embargo End Date
Type
Conference proceeding
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
Training large language models for long context understanding faces the challenge of data shortage. Previous data engineering approaches mechanically concatenate short documents, which may create many pseudo long documents but raise concerns about data quality. In this paper, we study the core attribute of high quality data for long context training, and provide a data pipeline, LongPack, to scale such data. We found that long distance referrals, which occur in natural long documents, are crucial for long-context training. However, simply concatenating short documents does not reliably generate these relations. We further show that the density of long-distance referrals, which is higher in longer documents, has a key role in training efficiency, making previous upsampling methods suboptimal. To enrich long documents, we propose LongPack, a data pipeline that constructs long documents by packing shorter ones based on referral relationships. Specifically, for web pages, which are the primary source for language model training, we found hyper-link a native signal for such a relation. By packing web pages through their hyper-link connection, we can create longer, high-quality documents. Our experiments demonstrate that LongPack is highly scalable, generating a corpus of long documents equivalent in size to an entire pretraining dataset using just 0.5% root documents. Furthermore, the constructed documents have a 'near-natural' quality as innate long documents for long context training, reaching a 32.7% higher score than previous state-of-the-art methods. © 2025 13th International Conference on Learning Representations, ICLR 2025. All rights reserved.
Citation
Y. Zhuang et al., “Scaling Long Context Training Data by Long-Distance Referrals,” International Conference on Representation Learning, vol. 2025, pp. 101607–101624, May 2025
Source
13th International Conference on Learning Representations, ICLR 2025
Conference
13th International Conference on Learning Representations, ICLR 2025
Keywords
Subjects
Source
13th International Conference on Learning Representations, ICLR 2025
Publisher
International Conference on Learning Representations, ICLR
DOI
Full-text link