Hyperbolic Learning with Multimodal Large Language Models

Mandica, Paolo; Franco, Luca; Kallidromitis, Konstantinos; Petryk, Suzanne; Galasso, Fabio

doi:10.1007/978-3-031-73232-4

Hyperbolic embeddings have demonstrated their effectiveness in capturing measures of uncertainty and hierarchical relationships across various deep-learning tasks, including image segmentation and active learning. However, their application in modern vision-language models (VLMs) has been limited. A notable exception is MERU, which leverages the hierarchical properties of hyperbolic space in the CLIP ViT-large model, consisting of hundreds of millions parameters. In our work, we address the challenges of scaling multi-modal hyperbolic models by orders of magnitude in terms of parameters (billions) and training complexity using the BLIP-2 architecture. Although hyperbolic embeddings offer potential insights into uncertainty not present in Euclidean embeddings, our analysis reveals that scaling these models is particularly difficult. We propose a novel training strategy for a hyperbolic version of BLIP-2, which allows to achieve comparable performance to its Euclidean counterpart, while maintaining stability throughout the training process and showing a meaningful indication of uncertainty with each embedding.

Hyperbolic Learning with Multimodal Large Language Models / Mandica, Paolo; Franco, Luca; Kallidromitis, Konstantinos; Petryk, Suzanne; Galasso, Fabio. - (2024). (Intervento presentato al convegno European Conference on Computer Vision tenutosi a Milan; Italy) [10.1007/978-3-031-73232-4].

Hyperbolic Learning with Multimodal Large Language Models

Paolo Mandica^Co-primo;Luca Franco^Co-primo;Konstantinos Kallidromitis;Suzanne Petryk;Fabio Galasso^Ultimo

2024

Abstract

Hyperbolic embeddings have demonstrated their effectiveness in capturing measures of uncertainty and hierarchical relationships across various deep-learning tasks, including image segmentation and active learning. However, their application in modern vision-language models (VLMs) has been limited. A notable exception is MERU, which leverages the hierarchical properties of hyperbolic space in the CLIP ViT-large model, consisting of hundreds of millions parameters. In our work, we address the challenges of scaling multi-modal hyperbolic models by orders of magnitude in terms of parameters (billions) and training complexity using the BLIP-2 architecture. Although hyperbolic embeddings offer potential insights into uncertainty not present in Euclidean embeddings, our analysis reveals that scaling these models is particularly difficult. We propose a novel training strategy for a hyperbolic version of BLIP-2, which allows to achieve comparable performance to its Euclidean counterpart, while maintaining stability throughout the training process and showing a meaningful indication of uncertainty with each embedding.

Scheda breve

Scheda completa

	Anno di pubblicazione
	
				2024
			
	Nome convegno
	
				European Conference on Computer Vision
			
	Parole chiave
	
				multimodal learning; large language models; llm; hyperbolic geometry; uncertainty estimation
			
	Tipologia
	
				04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
			
	Citazione
	
				Hyperbolic Learning with Multimodal Large Language Models / Mandica, Paolo; Franco, Luca; Kallidromitis, Konstantinos; Petryk, Suzanne; Galasso, Fabio. - (2024). (Intervento presentato al  convegno European Conference on Computer Vision tenutosi a Milan; Italy) [10.1007/978-3-031-73232-4].

File allegati a questo prodotto

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1724542

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

ND

ND

Catalogo dei prodotti della ricerca