PSscheduler: A parameter synchronization scheduling algorithm for distributed machine learning in reconfigurable optical networks
Liu, Ling ; Xu, Xiaoqiong ; Zhou, Pan ; Chen, Xi ; Ergu, Daji ; Yu, Hongfang ; Sun, Gang ; Guizani, Mohsen
Liu, Ling
Xu, Xiaoqiong
Zhou, Pan
Chen, Xi
Ergu, Daji
Yu, Hongfang
Sun, Gang
Guizani, Mohsen
Supervisor
Department
Machine Learning
Embargo End Date
Type
Journal article
Date
2025
License
Language
English
Collections
Research Projects
Organizational Units
Journal Issue
Abstract
With the increasing size of training datasets and models, parameter synchronization stage puts a heavy burden on the network, and communication has become one of the main performance bottlenecks of distributed machine learning (DML). Concurrently, optical circuit switch (OCS) with high bandwidth and reconfigurable features has increasingly introduced into the construction of network topology, obtaining the reconfigurable optical networks. Actually, OCS is conducive to accelerating the parameter synchronization stage, and thus improves training performance. However, unreasonable circuit scheduling algorithm has a great impact on parameter synchronization time because of non-negligible OCS switching delay. Besides, most of the existing circuit scheduling algorithms do not effectively use the training characteristics of DML, and the performance gains are limited. Therefore, in this paper, we study the parameter synchronization scheduling algorithm in reconfigurable optical networks, and propose PSscheduler by jointly optimizing the circuit scheduling and deployment of parameter servers in parameter server (PS) architecture. Specifically, a mathematical optimization model is established first, which takes into account the deployment of parameter servers, the allocation of parameter blocks and circuit scheduling. Subsequently, the mathematical model is solved by relaxed variables and deterministic rounding approach. The results of simulation based on real DML workloads demonstrate that compared to Sunflow and HLF , PSscheduler is more stable and can reduce parameter synchronization time (PST) by up to 46.61% and 25%, respectively.
Citation
L. Liu et al., “PSscheduler: A parameter synchronization scheduling algorithm for distributed machine learning in reconfigurable optical networks,” Neurocomputing, vol. 616, p. 128876, Feb. 2025, doi: 10.1016/J.NEUCOM.2024.128876.
Source
Neurocomputing
Conference
Keywords
Distributed machine learning (DML), Parameter server (PS) architecture, Reconfigurable optical network, Parameter synchronization scheduling, Training performance
Subjects
Source
Publisher
Elsevier
