Self-attention as an attractor network: transient memories without backpropagation

Damico, Francesco; Negri, Matteo

doi:10.1109/COMPENG60905.2024.10741429

Transformers are one of the most successful architectures of modern neural networks. At their core there is the so-called attention mechanism, which recently interested the physics community as it can be written as the derivative of an energy function in certain cases: while it is possible to write the cross-attention layer as a modern Hopfield network, the same is not possible for the self-attention, which is used in the GPT architectures and other autoregressive models. In this work we show that it is possible to obtain the self-attention layer as the derivative of local energy terms, which resemble a pseudo-likelihood. We leverage the analogy with pseudo-likelihood to design a recurrent model that can be trained without backpropagation: the dynamics shows transient states that are strongly correlated with both train and test examples. Overall we present a novel framework to interpret self-attention as an attractor network, potentially paving the way for new theoretical approaches inspired from physics to understand transformers.

Self-attention as an attractor network: transient memories without backpropagation / Damico, Francesco; Negri, Matteo. - (2024), pp. 1-6. ( 2024 IEEE Workshop on Complexity in Engineering, COMPENG 2024 ita ) [10.1109/COMPENG60905.2024.10741429].

Self-attention as an attractor network: transient memories without backpropagation

DAmico, Francesco^Primo;Negri, Matteo^Secondo

2024

Abstract

Transformers are one of the most successful architectures of modern neural networks. At their core there is the so-called attention mechanism, which recently interested the physics community as it can be written as the derivative of an energy function in certain cases: while it is possible to write the cross-attention layer as a modern Hopfield network, the same is not possible for the self-attention, which is used in the GPT architectures and other autoregressive models. In this work we show that it is possible to obtain the self-attention layer as the derivative of local energy terms, which resemble a pseudo-likelihood. We leverage the analogy with pseudo-likelihood to design a recurrent model that can be trained without backpropagation: the dynamics shows transient states that are strongly correlated with both train and test examples. Overall we present a novel framework to interpret self-attention as an attractor network, potentially paving the way for new theoretical approaches inspired from physics to understand transformers.

Scheda breve

Scheda completa

	Anno di pubblicazione
	
				2024
			
	Nome convegno
	
				2024 IEEE Workshop on Complexity in Engineering, COMPENG 2024
			
	Parole chiave
	
				Associative memories; Transformers; Self-Attention
			
	Tipologia
	
				04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
			
	Citazione
	
				Self-attention as an attractor network: transient memories without backpropagation / Damico, Francesco; Negri, Matteo. - (2024), pp. 1-6. ( 2024 IEEE Workshop on Complexity in Engineering, COMPENG 2024 ita ) [10.1109/COMPENG60905.2024.10741429].

File allegati a questo prodotto

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1749976

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

0

0

Catalogo dei prodotti della ricerca