Do Recommender Systems Really Leverage Multimodal Content? A Comprehensive Analysis on Multimodal Representations for Recommendation

Pomo, C.; Attimonelli, M.; Danese, D.; Narducci, F.; Di Noia, T.

doi:10.1145/3746252.3761398

Multimodal Recommender Systems aim to improve recommendation accuracy by integrating heterogeneous content, such as images and textual metadata. While effective, it remains unclear whether their gains stem from true multimodal understanding or increased model complexity. This work investigates the role of multimodal item embeddings, emphasizing the semantic informativeness of the representations. Initial experiments reveal that embeddings from standard extractors (e.g., ResNet50, Sentence-Bert) enhance performance, but rely on modality-specific encoders and ad hoc fusion strategies that lack control over cross-modal alignment. To overcome these limitations, we leverage Large Vision-Language Models (LVLMs) to generate multimodal-by-design embeddings via structured prompts. This approach yields semantically aligned representations without requiring any fusion. Experiments across multiple settings show notable performance improvements. Furthermore, LVLMs embeddings offer a distinctive advantage: they can be decoded into structured textual descriptions, enabling direct assessment of their multimodal comprehension. When such descriptions are incorporated as side content into recommender systems, they improve recommendation performance, empirically validating the semantic alignment encoded in LVLMs outputs. Our study highlights the importance of semantically rich representations and positions LVLMs as a compelling foundation to build robust and meaningful multimodal representations in recommendation tasks.

Do Recommender Systems Really Leverage Multimodal Content? A Comprehensive Analysis on Multimodal Representations for Recommendation / Pomo, C.; Attimonelli, M.; Danese, D.; Narducci, F.; Di Noia, T.. - (2025), pp. 2377-2387. ( 34th ACM International Conference on Information and Knowledge Management, CIKM 2025 kor ) [10.1145/3746252.3761398].

Do Recommender Systems Really Leverage Multimodal Content? A Comprehensive Analysis on Multimodal Representations for Recommendation

Pomo C.^Co-primo;Attimonelli M.^Co-primo;Narducci F.;Di Noia T.

2025

Abstract

Multimodal Recommender Systems aim to improve recommendation accuracy by integrating heterogeneous content, such as images and textual metadata. While effective, it remains unclear whether their gains stem from true multimodal understanding or increased model complexity. This work investigates the role of multimodal item embeddings, emphasizing the semantic informativeness of the representations. Initial experiments reveal that embeddings from standard extractors (e.g., ResNet50, Sentence-Bert) enhance performance, but rely on modality-specific encoders and ad hoc fusion strategies that lack control over cross-modal alignment. To overcome these limitations, we leverage Large Vision-Language Models (LVLMs) to generate multimodal-by-design embeddings via structured prompts. This approach yields semantically aligned representations without requiring any fusion. Experiments across multiple settings show notable performance improvements. Furthermore, LVLMs embeddings offer a distinctive advantage: they can be decoded into structured textual descriptions, enabling direct assessment of their multimodal comprehension. When such descriptions are incorporated as side content into recommender systems, they improve recommendation performance, empirically validating the semantic alignment encoded in LVLMs outputs. Our study highlights the importance of semantically rich representations and positions LVLMs as a compelling foundation to build robust and meaningful multimodal representations in recommendation tasks.

Scheda breve

Scheda completa

	Anno di pubblicazione
	
				2025
			
	Nome convegno
	
				34th ACM International Conference on Information and Knowledge Management, CIKM 2025
			
	Parole chiave
	
				large vision-language models; multimodal recommender systems; multimodal representation learning
			
	Tipologia
	
				04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
			
	Citazione
	
				Do Recommender Systems Really Leverage Multimodal Content? A Comprehensive Analysis on Multimodal Representations for Recommendation / Pomo, C.; Attimonelli, M.; Danese, D.; Narducci, F.; Di Noia, T.. - (2025), pp. 2377-2387. ( 34th ACM International Conference on Information and Knowledge Management, CIKM 2025 kor ) [10.1145/3746252.3761398].

File allegati a questo prodotto

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1758116

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

0

ND

Catalogo dei prodotti della ricerca