This paper presents the first framework (up to the authors' knowledge) to address time-varying objectives in finite-horizon Deep Reinforcement Learning (DeepRL), based on a switching control solution developed on the ground of Bellman's principle of optimality. By augmenting the state space of the system with information on its visit time, the DeepRL agent is able to solve problems in which its task dynamically changes within the same episode. To address the scalability problems caused by the state space augmentation, we propose a procedure to partition the episode length to define separate sub-problems that are then solved by specialised DeepRL agents. Contrary to standard solutions, with the proposed approach the DeepRL agents correctly estimate the value function at each time-step and are hence able to solve time-varying tasks. Numerical simulations validate the approach in a classic RL environment.
Bellman's principle of optimality and deep reinforcement learning for time-varying tasks / Giuseppi, A.; Pietrabissa, A.. - In: INTERNATIONAL JOURNAL OF CONTROL. - ISSN 0020-7179. - 95:9(2022), pp. 2448-2459. [10.1080/00207179.2021.1913516]
Bellman's principle of optimality and deep reinforcement learning for time-varying tasks
Giuseppi A.
Co-primo
;Pietrabissa A.Co-primo
2022
Abstract
This paper presents the first framework (up to the authors' knowledge) to address time-varying objectives in finite-horizon Deep Reinforcement Learning (DeepRL), based on a switching control solution developed on the ground of Bellman's principle of optimality. By augmenting the state space of the system with information on its visit time, the DeepRL agent is able to solve problems in which its task dynamically changes within the same episode. To address the scalability problems caused by the state space augmentation, we propose a procedure to partition the episode length to define separate sub-problems that are then solved by specialised DeepRL agents. Contrary to standard solutions, with the proposed approach the DeepRL agents correctly estimate the value function at each time-step and are hence able to solve time-varying tasks. Numerical simulations validate the approach in a classic RL environment.File | Dimensione | Formato | |
---|---|---|---|
Giuseppi_preprint_Bellmans_2021.pdf
accesso aperto
Note: https://doi.org/10.1080/00207179.2021.1913516
Tipologia:
Documento in Pre-print (manoscritto inviato all'editore, precedente alla peer review)
Licenza:
Tutti i diritti riservati (All rights reserved)
Dimensione
530.59 kB
Formato
Adobe PDF
|
530.59 kB | Adobe PDF | |
Giuseppi_Bellman's_2021.pdf
solo gestori archivio
Tipologia:
Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza:
Tutti i diritti riservati (All rights reserved)
Dimensione
2.13 MB
Formato
Adobe PDF
|
2.13 MB | Adobe PDF | Contatta l'autore |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.