The ability of vision transformers (ViTs) to accurately model global dependencies has completely changed the field of vision research. However, because of their drawbacks, such as their high computational costs, dependence on significant labeled datasets, and restricted capacity to capture essential local features, efforts are being made to create more effective alternatives. On the other hand, vision multilayer perceptron (MLP) architectures have shown excellent capability in image classification tasks, performing equivalent to or even better than the widely used state-of-the-art ViTs and convolutional neural networks (CNNs). Vision MLPs have linear computational complexity, require less training data, and can attain long-range data dependencies through advanced mechanisms similar to transformers at much lower computational costs. Thus, in this paper, a novel deep learning architecture is developed, namely, SimPoolFormer, to address current shortcomings imposed by vision transformers. SimPoolFormer is a two-stream attention-in-attention vision transformer architecture based on two computationally efficient networks. The developed architecture replaces the computationally intensive multi-headed self-attention in ViT with SimPool for efficiency, while ResMLP is adopted in a second stream to enhance hyperspectral image (HSI) classification, leveraging its linear attention-based design. Results illustrate that SimPoolFormer is significantly superior to several other deep learning models, including 1D-CNN, 2D-CNN, RNN, VGG-16, EfficientNet, ResNet-50, and ViT on three complex HSI datasets: QUH-Tangdaowan, QUH-Qingyun, and QUH-Pingan. For example, in terms of average accuracy, SimPoolFormer improved the HSI classification accuracy over 2D-CNN, VGG-16, EfficientNet, ViT, ResNet-50, RNN, and 1D-CNN by 0.98%, 3.81%, 4.16%, 7.94%, 9.45%, 12.25%, and 13.95%, respectively, on the QUH-Qingyun dataset.

SimPoolFormer. A two-stream vision transformer for hyperspectral image classification / Roy, Swalpa Kumar; Jamali, Ali; Chanussot, Jocelyn; Ghamisi, Pedram; Ghaderpour, Ebrahim; Shahabi, Himan. - In: REMOTE SENSING APPLICATIONS. - ISSN 2352-9385. - 37:(2025). [10.1016/j.rsase.2025.101478]

SimPoolFormer. A two-stream vision transformer for hyperspectral image classification

Ghaderpour, Ebrahim
;
2025

Abstract

The ability of vision transformers (ViTs) to accurately model global dependencies has completely changed the field of vision research. However, because of their drawbacks, such as their high computational costs, dependence on significant labeled datasets, and restricted capacity to capture essential local features, efforts are being made to create more effective alternatives. On the other hand, vision multilayer perceptron (MLP) architectures have shown excellent capability in image classification tasks, performing equivalent to or even better than the widely used state-of-the-art ViTs and convolutional neural networks (CNNs). Vision MLPs have linear computational complexity, require less training data, and can attain long-range data dependencies through advanced mechanisms similar to transformers at much lower computational costs. Thus, in this paper, a novel deep learning architecture is developed, namely, SimPoolFormer, to address current shortcomings imposed by vision transformers. SimPoolFormer is a two-stream attention-in-attention vision transformer architecture based on two computationally efficient networks. The developed architecture replaces the computationally intensive multi-headed self-attention in ViT with SimPool for efficiency, while ResMLP is adopted in a second stream to enhance hyperspectral image (HSI) classification, leveraging its linear attention-based design. Results illustrate that SimPoolFormer is significantly superior to several other deep learning models, including 1D-CNN, 2D-CNN, RNN, VGG-16, EfficientNet, ResNet-50, and ViT on three complex HSI datasets: QUH-Tangdaowan, QUH-Qingyun, and QUH-Pingan. For example, in terms of average accuracy, SimPoolFormer improved the HSI classification accuracy over 2D-CNN, VGG-16, EfficientNet, ViT, ResNet-50, RNN, and 1D-CNN by 0.98%, 3.81%, 4.16%, 7.94%, 9.45%, 12.25%, and 13.95%, respectively, on the QUH-Qingyun dataset.
2025
Deep learning; HSI; Hyperspectral data; MLP; Vision transformer; ViT
01 Pubblicazione su rivista::01a Articolo in rivista
SimPoolFormer. A two-stream vision transformer for hyperspectral image classification / Roy, Swalpa Kumar; Jamali, Ali; Chanussot, Jocelyn; Ghamisi, Pedram; Ghaderpour, Ebrahim; Shahabi, Himan. - In: REMOTE SENSING APPLICATIONS. - ISSN 2352-9385. - 37:(2025). [10.1016/j.rsase.2025.101478]
File allegati a questo prodotto
File Dimensione Formato  
Roy_SimPoolFormer_2025.pdf

solo gestori archivio

Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 5.73 MB
Formato Adobe PDF
5.73 MB Adobe PDF   Contatta l'autore

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1746244
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 16
  • ???jsp.display-item.citation.isi??? 13
social impact