: Vision Transformers show important results in the current Deep Learning technological landscape, being able to approach complex and dense tasks, for instance, Monocular Depth Estimation. However, in the transformer architecture, the attention module introduces a quadratic cost concerning the processed tokens. In dense Monocular Depth Estimation tasks, the inherently high computational complexity results in slow inference and poses significant challenges, particularly in resource-constrained onboard applications. To mitigate this issue, efficient attention modules have been developed. In this paper, we leverage these techniques to reduce the computational cost of networks designed for Monocular Depth Estimation, to reach an optimal trade-off between the quality of the results and inference speed. More specifically, optimization has been applied not only to the entire network but also independently to the encoder and decoder to assess the model's sensitivity to these modifications. Additionally, this paper introduces the use of the Pareto Frontier as an analytic method to get the optimal trade-off between the two objectives of quality and inference time. The results indicate that various optimised networks achieve performance comparable to, and in some cases surpass, their respective baselines, while significantly enhancing inference speed.

Efficient attention vision transformers for monocular depth estimation on resource-limited hardware / Schiavella, Claudio; Cirillo, Lorenzo; Papa, Lorenzo; Russo, Paolo; Amerini, Irene. - In: SCIENTIFIC REPORTS. - ISSN 2045-2322. - 15:1(2025). [10.1038/s41598-025-06112-8]

Efficient attention vision transformers for monocular depth estimation on resource-limited hardware

Schiavella, Claudio
Methodology
;
Cirillo, Lorenzo
Methodology
;
Papa, Lorenzo
Conceptualization
;
Russo, Paolo
Supervision
;
Amerini, Irene
Supervision
2025

Abstract

: Vision Transformers show important results in the current Deep Learning technological landscape, being able to approach complex and dense tasks, for instance, Monocular Depth Estimation. However, in the transformer architecture, the attention module introduces a quadratic cost concerning the processed tokens. In dense Monocular Depth Estimation tasks, the inherently high computational complexity results in slow inference and poses significant challenges, particularly in resource-constrained onboard applications. To mitigate this issue, efficient attention modules have been developed. In this paper, we leverage these techniques to reduce the computational cost of networks designed for Monocular Depth Estimation, to reach an optimal trade-off between the quality of the results and inference speed. More specifically, optimization has been applied not only to the entire network but also independently to the encoder and decoder to assess the model's sensitivity to these modifications. Additionally, this paper introduces the use of the Pareto Frontier as an analytic method to get the optimal trade-off between the two objectives of quality and inference time. The results indicate that various optimised networks achieve performance comparable to, and in some cases surpass, their respective baselines, while significantly enhancing inference speed.
2025
Computer vision; Edge devices; Efficient vision transformer; Monocular depth estimation
01 Pubblicazione su rivista::01a Articolo in rivista
Efficient attention vision transformers for monocular depth estimation on resource-limited hardware / Schiavella, Claudio; Cirillo, Lorenzo; Papa, Lorenzo; Russo, Paolo; Amerini, Irene. - In: SCIENTIFIC REPORTS. - ISSN 2045-2322. - 15:1(2025). [10.1038/s41598-025-06112-8]
File allegati a questo prodotto
File Dimensione Formato  
Schiavella_Efficient attention_2025.pdf

accesso aperto

Note: https://www.nature.com/articles/s41598-025-06112-8
Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza: Creative commons
Dimensione 4.52 MB
Formato Adobe PDF
4.52 MB Adobe PDF

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1742558
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact