Towards Multi-View Hand Pose Recognition Using a Fusion of Image Embeddings and Leap 2 Landmarks

Esteban-Romero, Sergio; Lanzino, Romeo; Marini, Marco; Gil-Martín, Manuel

doi:10.5220/0013234300003890

This paper presents a novel approach for multi-view hand pose recognition through image embeddings and hand landmarks. The method integrates raw image data with structural hand landmarks derived from the Leap Motion Controller 2. A Vision Transformer (ViT) pretrained model was used to extract visual features from dual-view grayscale images, which were fused with the corresponding Leap 2 hand landmarks, creating a multimodal representation that encapsulates both visual and landmark data for each sample. These fused embeddings were then classified using a multi-layer perceptron to distinguish among 17 distinct hand poses from the Multi-view Leap2 Hand Pose Dataset, which includes data from 21 subjects. Using a Leave-OneSubject-Out Cross-Validation (LOSO-CV) strategy, we demonstrate that this fusion approach offers a robust recognition performance (F1 Score of 79.33 ± 0.09 %), particularly in scenarios where hand occlusions or challenging angles may limit the utility of single-modality data.

Towards Multi-View Hand Pose Recognition Using a Fusion of Image Embeddings and Leap 2 Landmarks / Esteban-Romero, S., Lanzino, R., Marini, M., Gil-Martín, M.. - 3:(2025), pp. 918-925. (17th International Conference on Agents and Artificial Intelligence, ICAART 2025 Porto; Portugal ) [10.5220/0013234300003890].

Towards Multi-View Hand Pose Recognition Using a Fusion of Image Embeddings and Leap 2 Landmarks

Esteban-Romero, Sergio;Lanzino, Romeo;Marini, Marco;Gil-Martín, Manuel

2025

Abstract

This paper presents a novel approach for multi-view hand pose recognition through image embeddings and hand landmarks. The method integrates raw image data with structural hand landmarks derived from the Leap Motion Controller 2. A Vision Transformer (ViT) pretrained model was used to extract visual features from dual-view grayscale images, which were fused with the corresponding Leap 2 hand landmarks, creating a multimodal representation that encapsulates both visual and landmark data for each sample. These fused embeddings were then classified using a multi-layer perceptron to distinguish among 17 distinct hand poses from the Multi-view Leap2 Hand Pose Dataset, which includes data from 21 subjects. Using a Leave-OneSubject-Out Cross-Validation (LOSO-CV) strategy, we demonstrate that this fusion approach offers a robust recognition performance (F1 Score of 79.33 ± 0.09 %), particularly in scenarios where hand occlusions or challenging angles may limit the utility of single-modality data.

Scheda breve

Scheda completa

	Anno di pubblicazione
	
				2025
			
	Nome convegno
	
				17th International Conference on Agents and Artificial Intelligence, ICAART 2025
			
	Parole chiave
	
				Deep Learning; Leap Motion Controller 2; Multi-View Hand Pose Recognition; Multimodal Data; Multimodal Fusion
			
	Tipologia
	
				04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
			
	Citazione
	
				Towards Multi-View Hand Pose Recognition Using a Fusion of Image Embeddings and Leap 2 Landmarks / Esteban-Romero, S., Lanzino, R., Marini, M., Gil-Martín, M.. - 3:(2025), pp. 918-925. (17th International Conference on Agents and Artificial Intelligence, ICAART 2025 Porto; Portugal ) [10.5220/0013234300003890].
			
	Appartiene alla tipologia:
	
				04b Atto di convegno in volume

File allegati a questo prodotto

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1760643

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

1

ND

Catalogo dei prodotti della ricerca