Visual Sentiment Analysis aims to understand how images affect people in terms of evoked emotions. This paper presents a complete pipeline for comparing users’ emotional responses to images, enabling the analysis of potential discrepancies between machine-inferred and subjective affective states. The proposed framework consists of three main stages. The first stage employs a Convolutional Neural Network (CNN) enhanced with Feature Pyramid Network (FPN) layers to extract multi-scale visual features. Experimental results show that incorporating three additional FPN layers improves performance while introducing only a negligible increase in model complexity. In the second stage, a multimodal approach is adopted, where visual features are integrated with textual features derived from captions generated by an Image Captioning model. This fusion enriches the emotional context by combining visual and linguistic cues. In the final stage, a grounding mechanism is applied to align and merge sentiments from the different modalities into a unified representation. The algorithm’s output is then compared with the sentiment expressed by the user, enabling an analysis of the divergence between machine-inferred and human-perceived emotions.

A Multimodal Visual Sentiment Analysis Framework Enhanced With Feature Pyramid Networks / Galletti, D.; Ponzi, V.; Russo, S.. - 3984:(2025), pp. 55-63. ( 10th International Conference of Yearly Reports on Informatics, Mathematics, and Engineering, ICYRIME 2025 Czestochowa; Poland ).

A Multimodal Visual Sentiment Analysis Framework Enhanced With Feature Pyramid Networks

Ponzi V.
Secondo
Methodology
;
Russo S.
Ultimo
Supervision
2025

Abstract

Visual Sentiment Analysis aims to understand how images affect people in terms of evoked emotions. This paper presents a complete pipeline for comparing users’ emotional responses to images, enabling the analysis of potential discrepancies between machine-inferred and subjective affective states. The proposed framework consists of three main stages. The first stage employs a Convolutional Neural Network (CNN) enhanced with Feature Pyramid Network (FPN) layers to extract multi-scale visual features. Experimental results show that incorporating three additional FPN layers improves performance while introducing only a negligible increase in model complexity. In the second stage, a multimodal approach is adopted, where visual features are integrated with textual features derived from captions generated by an Image Captioning model. This fusion enriches the emotional context by combining visual and linguistic cues. In the final stage, a grounding mechanism is applied to align and merge sentiments from the different modalities into a unified representation. The algorithm’s output is then compared with the sentiment expressed by the user, enabling an analysis of the divergence between machine-inferred and human-perceived emotions.
2025
10th International Conference of Yearly Reports on Informatics, Mathematics, and Engineering, ICYRIME 2025
Feature Pyramid Network; Multimodal Evaluation; Visual Sentiment Analysis
04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
A Multimodal Visual Sentiment Analysis Framework Enhanced With Feature Pyramid Networks / Galletti, D.; Ponzi, V.; Russo, S.. - 3984:(2025), pp. 55-63. ( 10th International Conference of Yearly Reports on Informatics, Mathematics, and Engineering, ICYRIME 2025 Czestochowa; Poland ).
File allegati a questo prodotto
File Dimensione Formato  
Galletti_Multimodal_2025.pdf

accesso aperto

Note: https://ceur-ws.org/Vol-3984/p06.pdf
Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza: Creative commons
Dimensione 2.07 MB
Formato Adobe PDF
2.07 MB Adobe PDF

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1743424
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact