Transformers are one of the most successful architectures of modern neural networks. At their core there is the so-called attention mechanism, which recently interested the physics community as it can be written as the derivative of an energy function in certain cases: while it is possible to write the cross-attention layer as a modern Hopfield network, the same is not possible for the self-attention, which is used in the GPT architectures and other autoregressive models. In this work we show that it is possible to obtain the self-attention layer as the derivative of local energy terms, which resemble a pseudo-likelihood. We leverage the analogy with pseudo-likelihood to design a recurrent model that can be trained without backpropagation: the dynamics shows transient states that are strongly correlated with both train and test examples. Overall we present a novel framework to interpret self-attention as an attractor network, potentially paving the way for new theoretical approaches inspired from physics to understand transformers.

Self-attention as an attractor network: transient memories without backpropagation / Damico, Francesco; Negri, Matteo. - (2024), pp. 1-6. ( 2024 IEEE Workshop on Complexity in Engineering, COMPENG 2024 ita ) [10.1109/COMPENG60905.2024.10741429].

Self-attention as an attractor network: transient memories without backpropagation

DAmico, Francesco
Primo
;
Negri, Matteo
Secondo
2024

Abstract

Transformers are one of the most successful architectures of modern neural networks. At their core there is the so-called attention mechanism, which recently interested the physics community as it can be written as the derivative of an energy function in certain cases: while it is possible to write the cross-attention layer as a modern Hopfield network, the same is not possible for the self-attention, which is used in the GPT architectures and other autoregressive models. In this work we show that it is possible to obtain the self-attention layer as the derivative of local energy terms, which resemble a pseudo-likelihood. We leverage the analogy with pseudo-likelihood to design a recurrent model that can be trained without backpropagation: the dynamics shows transient states that are strongly correlated with both train and test examples. Overall we present a novel framework to interpret self-attention as an attractor network, potentially paving the way for new theoretical approaches inspired from physics to understand transformers.
2024
2024 IEEE Workshop on Complexity in Engineering, COMPENG 2024
Associative memories; Transformers; Self-Attention
04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
Self-attention as an attractor network: transient memories without backpropagation / Damico, Francesco; Negri, Matteo. - (2024), pp. 1-6. ( 2024 IEEE Workshop on Complexity in Engineering, COMPENG 2024 ita ) [10.1109/COMPENG60905.2024.10741429].
File allegati a questo prodotto
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1749976
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact