Hyperbolic embeddings have demonstrated their effectiveness in capturing measures of uncertainty and hierarchical relationships across various deep-learning tasks, including image segmentation and active learning. However, their application in modern vision-language models (VLMs) has been limited. A notable exception is MERU, which leverages the hierarchical properties of hyperbolic space in the CLIP ViT-large model, consisting of hundreds of millions parameters. In our work, we address the challenges of scaling multi-modal hyperbolic models by orders of magnitude in terms of parameters (billions) and training complexity using the BLIP-2 architecture. Although hyperbolic embeddings offer potential insights into uncertainty not present in Euclidean embeddings, our analysis reveals that scaling these models is particularly difficult. We propose a novel training strategy for a hyperbolic version of BLIP-2, which allows to achieve comparable performance to its Euclidean counterpart, while maintaining stability throughout the training process and showing a meaningful indication of uncertainty with each embedding.

Hyperbolic Learning with Multimodal Large Language Models / Mandica, Paolo; Franco, Luca; Kallidromitis, Konstantinos; Petryk, Suzanne; Galasso, Fabio. - (2024). (Intervento presentato al convegno European Conference on Computer Vision tenutosi a Milan; Italy) [10.1007/978-3-031-73232-4].

Hyperbolic Learning with Multimodal Large Language Models

Paolo Mandica
Co-primo
;
Luca Franco
Co-primo
;
Fabio Galasso
Ultimo
2024

Abstract

Hyperbolic embeddings have demonstrated their effectiveness in capturing measures of uncertainty and hierarchical relationships across various deep-learning tasks, including image segmentation and active learning. However, their application in modern vision-language models (VLMs) has been limited. A notable exception is MERU, which leverages the hierarchical properties of hyperbolic space in the CLIP ViT-large model, consisting of hundreds of millions parameters. In our work, we address the challenges of scaling multi-modal hyperbolic models by orders of magnitude in terms of parameters (billions) and training complexity using the BLIP-2 architecture. Although hyperbolic embeddings offer potential insights into uncertainty not present in Euclidean embeddings, our analysis reveals that scaling these models is particularly difficult. We propose a novel training strategy for a hyperbolic version of BLIP-2, which allows to achieve comparable performance to its Euclidean counterpart, while maintaining stability throughout the training process and showing a meaningful indication of uncertainty with each embedding.
2024
European Conference on Computer Vision
multimodal learning; large language models; llm; hyperbolic geometry; uncertainty estimation
04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
Hyperbolic Learning with Multimodal Large Language Models / Mandica, Paolo; Franco, Luca; Kallidromitis, Konstantinos; Petryk, Suzanne; Galasso, Fabio. - (2024). (Intervento presentato al convegno European Conference on Computer Vision tenutosi a Milan; Italy) [10.1007/978-3-031-73232-4].
File allegati a questo prodotto
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1724542
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact