Catalogo dei prodotti della ricerca

In multimodal learning, CLIP has emerged as the de facto approach for mapping different modalities into a shared latent space by bringing semantically similar representations closer while pushing apart dissimilar ones. However, CLIPbased contrastive losses exhibit unintended behaviors that negatively impact true semantic alignment, leading to sparse and fragmented latent spaces. This phenomenon, known as the modality gap, has been partially mitigated for standard text and image pairs but remains unknown and unresolved in more complex multimodal settings, such as the medical domain. In this work, we study this phenomenon in the latter case, revealing that the modality gap is present also in medical alignment, and we propose a modality-agnostic framework that closes this gap, ensuring that semantically related representations are more aligned, regardless of their source modality. Our method enhances alignment between radiology images and clinical text, improving cross-modal retrieval and image captioning.

Closing the gap in multimodal medical representation alignment / Grassucci, Eleonora; Cicchetti, Giordano; Comminiello, Danilo. - (2025). ( IEEE International Workshop on Machine Learning for Signal Processing, MLSP Istanbul; Turkey ) [10.1109/MLSP62443.2025.11204242].

Closing the gap in multimodal medical representation alignment

Eleonora Grassucci^{Primo

Methodology};Giordano Cicchetti^{Secondo

Software};Danilo Comminiello^{Ultimo

Conceptualization}

2025

Abstract

In multimodal learning, CLIP has emerged as the de facto approach for mapping different modalities into a shared latent space by bringing semantically similar representations closer while pushing apart dissimilar ones. However, CLIPbased contrastive losses exhibit unintended behaviors that negatively impact true semantic alignment, leading to sparse and fragmented latent spaces. This phenomenon, known as the modality gap, has been partially mitigated for standard text and image pairs but remains unknown and unresolved in more complex multimodal settings, such as the medical domain. In this work, we study this phenomenon in the latter case, revealing that the modality gap is present also in medical alignment, and we propose a modality-agnostic framework that closes this gap, ensuring that semantically related representations are more aligned, regardless of their source modality. Our method enhances alignment between radiology images and clinical text, improving cross-modal retrieval and image captioning.

Scheda breve

Scheda completa

	Anno di pubblicazione
	
				2025
			
	Nome convegno
	
				IEEE International Workshop on Machine Learning for Signal Processing, MLSP
			
	Parole chiave
	
				multimodal learning; representation learning; medical analysis
			
	Tipologia
	
				04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
			
	Citazione
	
				Closing the gap in multimodal medical representation alignment / Grassucci, Eleonora; Cicchetti, Giordano; Comminiello, Danilo. - (2025). ( IEEE International Workshop on Machine Learning for Signal Processing, MLSP Istanbul; Turkey ) [10.1109/MLSP62443.2025.11204242].
			
	Appartiene alla tipologia:
	
				04b Atto di convegno in volume

File allegati a questo prodotto

File	Dimensione	Formato
Grassucci_Closing_2025.pdf accesso aperto Tipologia: Documento in Post-print (versione successiva alla peer review e accettata per la pubblicazione) Licenza: Creative commons Dimensione 1.33 MB Formato Adobe PDF	1.33 MB	Adobe PDF

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1764467

Citazioni

ND

ND

ND

social impact