With the wide application of deep learning, the amount of data required to train deep learning models is becoming increasingly larger, resulting in an increased training time and higher requirements for computing resources. To improve the throughput of a distributed learning system, task scheduling and resource scheduling are required. This article proposes to combine ARIMA and GRU models to predict the future task volume. In terms of task scheduling, multi-priority task queues are used to divide tasks into different queues according to their priorities to ensure that high-priority tasks can be completed in advance. In terms of resource scheduling, the reinforcement learning method is adopted to manage limited computing resources. The reward function of reinforcement learning is constructed based on the resources occupied by the task, the training time, the accuracy of the model. When a distributed learning model tends to converge, the computing resources of the task are gradually reduced so that they can be allocated to other learning tasks. The results of experiments demonstrate that RLPTO tends to use more compu-ting nodes when facing tasks with large data scale and has good scalability. The distributed learning system reward experiment shows that RLPTO can make the computing cluster get the largest reward.

RLPTO: A Reinforcement Learning-Based Performance-Time Optimized Task and Resource Scheduling Mechanism for Distributed Machine Learning / Lu, X.; Liu, C.; Zhu, S.; Mao, Y.; Lio, P.; Hui, P.. - In: IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS. - ISSN 1045-9219. - 34:12(2023), pp. 3266-3279. [10.1109/TPDS.2023.3317388]

RLPTO: A Reinforcement Learning-Based Performance-Time Optimized Task and Resource Scheduling Mechanism for Distributed Machine Learning

Lio P.;
2023

Abstract

With the wide application of deep learning, the amount of data required to train deep learning models is becoming increasingly larger, resulting in an increased training time and higher requirements for computing resources. To improve the throughput of a distributed learning system, task scheduling and resource scheduling are required. This article proposes to combine ARIMA and GRU models to predict the future task volume. In terms of task scheduling, multi-priority task queues are used to divide tasks into different queues according to their priorities to ensure that high-priority tasks can be completed in advance. In terms of resource scheduling, the reinforcement learning method is adopted to manage limited computing resources. The reward function of reinforcement learning is constructed based on the resources occupied by the task, the training time, the accuracy of the model. When a distributed learning model tends to converge, the computing resources of the task are gradually reduced so that they can be allocated to other learning tasks. The results of experiments demonstrate that RLPTO tends to use more compu-ting nodes when facing tasks with large data scale and has good scalability. The distributed learning system reward experiment shows that RLPTO can make the computing cluster get the largest reward.
2023
Cloud computing; distributed computing; dynamic scheduling; reinforcement learning; resource management; scheduling algorithms
01 Pubblicazione su rivista::01a Articolo in rivista
RLPTO: A Reinforcement Learning-Based Performance-Time Optimized Task and Resource Scheduling Mechanism for Distributed Machine Learning / Lu, X.; Liu, C.; Zhu, S.; Mao, Y.; Lio, P.; Hui, P.. - In: IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS. - ISSN 1045-9219. - 34:12(2023), pp. 3266-3279. [10.1109/TPDS.2023.3317388]
File allegati a questo prodotto
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1723968
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 1
  • ???jsp.display-item.citation.isi??? 0
social impact