IoT and edge devices, capable of capturing data from their surroundings, are becoming increasingly popular. However, the onboard analysis of the acquired data is usually limited by their computational capabilities. Consequently, the most recent and accurate deep learning technologies, such as Vision Transformers (ViT) and their hybrid (hViT) versions, are typically too cumbersome to be exploited for onboard inferences. Therefore, the purpose of this work is to analyze and investigate the impact of efficient ViT methodologies applied to the monocular depth estimation (MDE) task, which computes the depth map from an RGB image. This task is a critical feature for autonomous and robotic systems in order to perceive the surrounding environment. More in detail, this work leverages innovative solutions designed to reduce the computational cost of self-attention, the fundamental element on which ViTs are based, applying this modification to METER architecture, a lightweight model designed to tackle the MDE task which can be further enhanced. The proposed efficient variants, namely Meta-METER and Pyra-METER, are capable of achieving an average speed boost of 41.4% and 34.4% respectively, over a variety of edge devices when compared with the original model, while keeping a limited degradation of the estimation capabilities when tested on the indoor NYU dataset.
Optimize Vision Transformer Architecture via Efficient Attention Modules: A Study on the Monocular Depth Estimation Task / Schiavella, C.; Cirillo, L.; Papa, L.; Russo, P.; Amerini, I.. - 14365:(2024), pp. 383-394. (Intervento presentato al convegno Proceedings of the 22nd International Conference on Image Analysis and Processing, ICIAP 2023 tenutosi a ita) [10.1007/978-3-031-51023-6_32].
Optimize Vision Transformer Architecture via Efficient Attention Modules: A Study on the Monocular Depth Estimation Task
Schiavella C.;Cirillo L.;Papa L.;Russo P.;Amerini I.
2024
Abstract
IoT and edge devices, capable of capturing data from their surroundings, are becoming increasingly popular. However, the onboard analysis of the acquired data is usually limited by their computational capabilities. Consequently, the most recent and accurate deep learning technologies, such as Vision Transformers (ViT) and their hybrid (hViT) versions, are typically too cumbersome to be exploited for onboard inferences. Therefore, the purpose of this work is to analyze and investigate the impact of efficient ViT methodologies applied to the monocular depth estimation (MDE) task, which computes the depth map from an RGB image. This task is a critical feature for autonomous and robotic systems in order to perceive the surrounding environment. More in detail, this work leverages innovative solutions designed to reduce the computational cost of self-attention, the fundamental element on which ViTs are based, applying this modification to METER architecture, a lightweight model designed to tackle the MDE task which can be further enhanced. The proposed efficient variants, namely Meta-METER and Pyra-METER, are capable of achieving an average speed boost of 41.4% and 34.4% respectively, over a variety of edge devices when compared with the original model, while keeping a limited degradation of the estimation capabilities when tested on the indoor NYU dataset.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.