Catalogo dei prodotti della ricerca

We present Pure-Past Action Masking (PPAM), a lightweight approach to action masking for safe reinforcement learning. In PPAM, actions are disallowed (“masked”) according to specifications expressed in Pure-Past Linear Temporal Logic (PPLTL). PPAM can enforce non-Markovian constraints, i.e., constraints based on the history of the system, rather than just the current state of the (possibly hidden) MDP. The features used in the safety constraint need not be the same as those used by the learning agent, allowing a clear separation of concerns between the safety constraints and reward specifications of the (learning) agent. We prove formally that an agent trained with PPAM can learn any optimal policy that satisfies the safety constraints, and that they are as expressive as shields, another approach to enforce non-Markovian constraints in RL. Finally, we provide empirical results showing how PPAM can guarantee constraint satisfaction in practice.

Pure-Past Action Masking / Varricchione, Giovanni; Alechina, Natasha; Dastani, Mehdi; De Giacomo, Giuseppe; Logan, Brian; Perelli, Giuseppe. - 38:19(2024), pp. 21646-21655. (Intervento presentato al convegno National Conference of the American Association for Artificial Intelligence tenutosi a Vancouver, Canada) [10.1609/aaai.v38i19.30163].

Pure-Past Action Masking

Varricchione, Giovanni;Alechina, Natasha;Dastani, Mehdi;De Giacomo, Giuseppe;Logan, Brian;Perelli, Giuseppe

2024

Abstract

We present Pure-Past Action Masking (PPAM), a lightweight approach to action masking for safe reinforcement learning. In PPAM, actions are disallowed (“masked”) according to specifications expressed in Pure-Past Linear Temporal Logic (PPLTL). PPAM can enforce non-Markovian constraints, i.e., constraints based on the history of the system, rather than just the current state of the (possibly hidden) MDP. The features used in the safety constraint need not be the same as those used by the learning agent, allowing a clear separation of concerns between the safety constraints and reward specifications of the (learning) agent. We prove formally that an agent trained with PPAM can learn any optimal policy that satisfies the safety constraints, and that they are as expressive as shields, another approach to enforce non-Markovian constraints in RL. Finally, we provide empirical results showing how PPAM can guarantee constraint satisfaction in practice.

Scheda breve

Scheda completa

	Anno di pubblicazione
	
				2024
			
	Nome convegno
	
				National Conference of the American Association for Artificial Intelligence
			
	Parole chiave
	
				temporal reasoning; reinforcement learning; temporal logics
			
	Tipologia
	
				04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
			
	Citazione
	
				Pure-Past Action Masking / Varricchione, Giovanni; Alechina, Natasha; Dastani, Mehdi; De Giacomo, Giuseppe; Logan, Brian; Perelli, Giuseppe. - 38:19(2024), pp. 21646-21655. (Intervento presentato al  convegno National Conference of the American Association for Artificial Intelligence tenutosi a Vancouver, Canada) [10.1609/aaai.v38i19.30163].

File allegati a questo prodotto

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1707830

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

0

0

social impact