We present Pure-Past Action Masking (PPAM), a lightweight approach to action masking for safe reinforcement learning. In PPAM, actions are disallowed (“masked”) according to specifications expressed in Pure-Past Linear Temporal Logic (PPLTL). PPAM can enforce non-Markovian constraints, i.e., constraints based on the history of the system, rather than just the current state of the (possibly hidden) MDP. The features used in the safety constraint need not be the same as those used by the learning agent, allowing a clear separation of concerns between the safety constraints and reward specifications of the (learning) agent. We prove formally that an agent trained with PPAM can learn any optimal policy that satisfies the safety constraints, and that they are as expressive as shields, another approach to enforce non-Markovian constraints in RL. Finally, we provide empirical results showing how PPAM can guarantee constraint satisfaction in practice.
Pure-Past Action Masking / Varricchione, Giovanni; Alechina, Natasha; Dastani, Mehdi; De Giacomo, Giuseppe; Logan, Brian; Perelli, Giuseppe. - 38:19(2024), pp. 21646-21655. ( National Conference of the American Association for Artificial Intelligence Vancouver; Canada ) [10.1609/aaai.v38i19.30163].
Pure-Past Action Masking
Varricchione, Giovanni
;De Giacomo, Giuseppe
;Logan, Brian
;Perelli, Giuseppe
2024
Abstract
We present Pure-Past Action Masking (PPAM), a lightweight approach to action masking for safe reinforcement learning. In PPAM, actions are disallowed (“masked”) according to specifications expressed in Pure-Past Linear Temporal Logic (PPLTL). PPAM can enforce non-Markovian constraints, i.e., constraints based on the history of the system, rather than just the current state of the (possibly hidden) MDP. The features used in the safety constraint need not be the same as those used by the learning agent, allowing a clear separation of concerns between the safety constraints and reward specifications of the (learning) agent. We prove formally that an agent trained with PPAM can learn any optimal policy that satisfies the safety constraints, and that they are as expressive as shields, another approach to enforce non-Markovian constraints in RL. Finally, we provide empirical results showing how PPAM can guarantee constraint satisfaction in practice.| File | Dimensione | Formato | |
|---|---|---|---|
|
Varricchione_Pure-Past-Action_2024.pdf
accesso aperto
Note: https://ojs.aaai.org/index.php/AAAI/article/view/30163/32063 - DOI: https://doi.org/10.1609/aaai.v38i19.30163
Tipologia:
Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza:
Tutti i diritti riservati (All rights reserved)
Dimensione
294.49 kB
Formato
Adobe PDF
|
294.49 kB | Adobe PDF |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


