This paper introduces a lightweight architecture for multi-view hand pose recognition on multimodal fusion of images and landmarks. The proposed model employs a compact Convolutional Neural Network (CNN) to extract visual features from dual-view grayscale images, while a Multi-Layer Perceptron (MLP) processes the corresponding Leap Motion Controller 2 hand landmarks. The two modalities are fused to create an efficient yet discriminative representation. Compared to the Vision Transformer (ViT)+MLP baseline, which achieves an F1 score of 79.33 ± 0.09 % with 8.95 × 107 parameters, our CNN+MLP model reaches a higher recognition accuracy of 85.36 ± 0.08 % while requiring only 2.13 × 105 parameters, which corresponds to an important reduction of the model size. Moreover, a landmarks-only variant using the MLP achieves 85.22 ± 0.08 % accuracy with just 6.46 × 104 parameters. These results, obtained on the Multi-view Leap2 Hand Pose Dataset under a Leave-One-Subject-Out Cross-Validation protocol, demonstrate that accurate multi-view hand pose recognition can be achieved with dramatically fewer parameters, enabling efficient deployment in resource-constrained environments.
A Lightweight Model for Accurate Multi-View Hand Pose Recognition / Kadyrzhanov, Artur; Esteban-Romero, Sergio; Gil-Martín, Manuel; Marini, Marco. - (2026), pp. 2342-2349. ( the 18th International Conference on Agents and Artificial Intelligence Marbella, Spain ) [10.5220/0014235600004052].
A Lightweight Model for Accurate Multi-View Hand Pose Recognition
Marini, Marco
2026
Abstract
This paper introduces a lightweight architecture for multi-view hand pose recognition on multimodal fusion of images and landmarks. The proposed model employs a compact Convolutional Neural Network (CNN) to extract visual features from dual-view grayscale images, while a Multi-Layer Perceptron (MLP) processes the corresponding Leap Motion Controller 2 hand landmarks. The two modalities are fused to create an efficient yet discriminative representation. Compared to the Vision Transformer (ViT)+MLP baseline, which achieves an F1 score of 79.33 ± 0.09 % with 8.95 × 107 parameters, our CNN+MLP model reaches a higher recognition accuracy of 85.36 ± 0.08 % while requiring only 2.13 × 105 parameters, which corresponds to an important reduction of the model size. Moreover, a landmarks-only variant using the MLP achieves 85.22 ± 0.08 % accuracy with just 6.46 × 104 parameters. These results, obtained on the Multi-view Leap2 Hand Pose Dataset under a Leave-One-Subject-Out Cross-Validation protocol, demonstrate that accurate multi-view hand pose recognition can be achieved with dramatically fewer parameters, enabling efficient deployment in resource-constrained environments.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


