With the advent of deep learning and, more recently, large models, recommendation systems have greatly refined their capability of profiling users’ preferences and interests that, in most cases, are complex to disentangle. This is especially true for those recommendation algorithms that rely heavily on external side information, such as multimodal recommender systems. In specific domains like fashion, music, and movie recommendation, the multi-faceted features characterizing products and services may influence each customer on online platforms differently, paving the way to novel multimodal recommendation models that can learn from such multimodal content. According to the literature, the common multimodal recommendation pipeline involves (i) extracting multimodal features, (ii) refining their high-level representations to suit the recommendation task, (iii) optionally fusing all multimodal features, and (iv) predicting the user-item score. Although great effort has been put into designing optimal solutions for (ii-iv), to the best of our knowledge, very little attention has been devoted to exploring procedures for (i) in a rigorous way. In this respect, the existing literature outlines the large availability of multimodal datasets and the ever-growing number of large models accounting for multimodal-aware tasks, but (at the same time) an unjustified adoption of limited standardized solutions. As very recent works from the literature have begun to conduct empirical studies to assess the contribution of multimodality in recommendation, we decide to follow and complement this same research direction. To this end, this paper settles as the first attempt to offer a large-scale benchmarking for multimodal recommender systems, with a specific focus on multimodal extractors. Specifically, we take advantage of three popular and recent frameworks for multimodal feature extraction and reproducibility in recommendation, Ducho, and MMRec/Elliot, respectively, to offer a unified and ready-to-use experimental environment able to run extensive benchmarking analyses leveraging novel multimodal feature extractors. Results, largely validated under different extractors, hyper-parameters of the extractors, domains, and modalities, provide important insights on how to train and tune the next generation of multimodal recommendation algorithms.
Large-scale Benchmarks for Multimodal Recommendation with Ducho / Attimonelli, Matteo; Danese, Danilo; Di Fazio, Angela; Malitesta, Daniele; Pomo, Claudio; Di Noia, Tommaso. - In: EXPERT SYSTEMS WITH APPLICATIONS. - ISSN 0957-4174. - (2025). [10.1016/j.eswa.2025.130813]
Large-scale Benchmarks for Multimodal Recommendation with Ducho
Matteo Attimonelli
Primo
;
2025
Abstract
With the advent of deep learning and, more recently, large models, recommendation systems have greatly refined their capability of profiling users’ preferences and interests that, in most cases, are complex to disentangle. This is especially true for those recommendation algorithms that rely heavily on external side information, such as multimodal recommender systems. In specific domains like fashion, music, and movie recommendation, the multi-faceted features characterizing products and services may influence each customer on online platforms differently, paving the way to novel multimodal recommendation models that can learn from such multimodal content. According to the literature, the common multimodal recommendation pipeline involves (i) extracting multimodal features, (ii) refining their high-level representations to suit the recommendation task, (iii) optionally fusing all multimodal features, and (iv) predicting the user-item score. Although great effort has been put into designing optimal solutions for (ii-iv), to the best of our knowledge, very little attention has been devoted to exploring procedures for (i) in a rigorous way. In this respect, the existing literature outlines the large availability of multimodal datasets and the ever-growing number of large models accounting for multimodal-aware tasks, but (at the same time) an unjustified adoption of limited standardized solutions. As very recent works from the literature have begun to conduct empirical studies to assess the contribution of multimodality in recommendation, we decide to follow and complement this same research direction. To this end, this paper settles as the first attempt to offer a large-scale benchmarking for multimodal recommender systems, with a specific focus on multimodal extractors. Specifically, we take advantage of three popular and recent frameworks for multimodal feature extraction and reproducibility in recommendation, Ducho, and MMRec/Elliot, respectively, to offer a unified and ready-to-use experimental environment able to run extensive benchmarking analyses leveraging novel multimodal feature extractors. Results, largely validated under different extractors, hyper-parameters of the extractors, domains, and modalities, provide important insights on how to train and tune the next generation of multimodal recommendation algorithms.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


