This paper introduces a lightweight architecture for multi-view hand pose recognition on multimodal fusion of images and landmarks. The proposed model employs a compact Convolutional Neural Network (CNN) to extract visual features from dual-view grayscale images, while a Multi-Layer Perceptron (MLP) processes the corresponding Leap Motion Controller 2 hand landmarks. The two modalities are fused to create an efficient yet discriminative representation. Compared to the Vision Transformer (ViT)+MLP baseline, which achieves an F1 score of 79.33 ± 0.09 % with 8.95 × 107 parameters, our CNN+MLP model reaches a higher recognition accuracy of 85.36 ± 0.08 % while requiring only 2.13 × 105 parameters, which corresponds to an important reduction of the model size. Moreover, a landmarks-only variant using the MLP achieves 85.22 ± 0.08 % accuracy with just 6.46 × 104 parameters. These results, obtained on the Multi-view Leap2 Hand Pose Dataset under a Leave-One-Subject-Out Cross-Validation protocol, demonstrate that accurate multi-view hand pose recognition can be achieved with dramatically fewer parameters, enabling efficient deployment in resource-constrained environments.

A Lightweight Model for Accurate Multi-View Hand Pose Recognition / Kadyrzhanov, Artur; Esteban-Romero, Sergio; Gil-Martín, Manuel; Marini, Marco. - (2026), pp. 2342-2349. ( the 18th International Conference on Agents and Artificial Intelligence Marbella, Spain ) [10.5220/0014235600004052].

A Lightweight Model for Accurate Multi-View Hand Pose Recognition

Marini, Marco
2026

Abstract

This paper introduces a lightweight architecture for multi-view hand pose recognition on multimodal fusion of images and landmarks. The proposed model employs a compact Convolutional Neural Network (CNN) to extract visual features from dual-view grayscale images, while a Multi-Layer Perceptron (MLP) processes the corresponding Leap Motion Controller 2 hand landmarks. The two modalities are fused to create an efficient yet discriminative representation. Compared to the Vision Transformer (ViT)+MLP baseline, which achieves an F1 score of 79.33 ± 0.09 % with 8.95 × 107 parameters, our CNN+MLP model reaches a higher recognition accuracy of 85.36 ± 0.08 % while requiring only 2.13 × 105 parameters, which corresponds to an important reduction of the model size. Moreover, a landmarks-only variant using the MLP achieves 85.22 ± 0.08 % accuracy with just 6.46 × 104 parameters. These results, obtained on the Multi-view Leap2 Hand Pose Dataset under a Leave-One-Subject-Out Cross-Validation protocol, demonstrate that accurate multi-view hand pose recognition can be achieved with dramatically fewer parameters, enabling efficient deployment in resource-constrained environments.
2026
the 18th International Conference on Agents and Artificial Intelligence
Multi-view Hand Pose Recognition; Lightweight Model; Leap Motion Controller 2; Multimodal data; Multimodal fusion; Deep Learning.
04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
A Lightweight Model for Accurate Multi-View Hand Pose Recognition / Kadyrzhanov, Artur; Esteban-Romero, Sergio; Gil-Martín, Manuel; Marini, Marco. - (2026), pp. 2342-2349. ( the 18th International Conference on Agents and Artificial Intelligence Marbella, Spain ) [10.5220/0014235600004052].
File allegati a questo prodotto
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1768221
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact