Visual Sentiment Analysis aims to understand how images affect people in terms of evoked emotions. This paper presents a complete pipeline for comparing users’ emotional responses to images, enabling the analysis of potential discrepancies between machine-inferred and subjective affective states. The proposed framework consists of three main stages. The first stage employs a Convolutional Neural Network (CNN) enhanced with Feature Pyramid Network (FPN) layers to extract multi-scale visual features. Experimental results show that incorporating three additional FPN layers improves performance while introducing only a negligible increase in model complexity. In the second stage, a multimodal approach is adopted, where visual features are integrated with textual features derived from captions generated by an Image Captioning model. This fusion enriches the emotional context by combining visual and linguistic cues. In the final stage, a grounding mechanism is applied to align and merge sentiments from the different modalities into a unified representation. The algorithm’s output is then compared with the sentiment expressed by the user, enabling an analysis of the divergence between machine-inferred and human-perceived emotions.
A Multimodal Visual Sentiment Analysis Framework Enhanced With Feature Pyramid Networks / Galletti, D.; Ponzi, V.; Russo, S.. - 3984:(2025), pp. 55-63. ( 10th International Conference of Yearly Reports on Informatics, Mathematics, and Engineering, ICYRIME 2025 Czestochowa; Poland ).
A Multimodal Visual Sentiment Analysis Framework Enhanced With Feature Pyramid Networks
Ponzi V.
Secondo
Methodology
;Russo S.Ultimo
Supervision
2025
Abstract
Visual Sentiment Analysis aims to understand how images affect people in terms of evoked emotions. This paper presents a complete pipeline for comparing users’ emotional responses to images, enabling the analysis of potential discrepancies between machine-inferred and subjective affective states. The proposed framework consists of three main stages. The first stage employs a Convolutional Neural Network (CNN) enhanced with Feature Pyramid Network (FPN) layers to extract multi-scale visual features. Experimental results show that incorporating three additional FPN layers improves performance while introducing only a negligible increase in model complexity. In the second stage, a multimodal approach is adopted, where visual features are integrated with textual features derived from captions generated by an Image Captioning model. This fusion enriches the emotional context by combining visual and linguistic cues. In the final stage, a grounding mechanism is applied to align and merge sentiments from the different modalities into a unified representation. The algorithm’s output is then compared with the sentiment expressed by the user, enabling an analysis of the divergence between machine-inferred and human-perceived emotions.| File | Dimensione | Formato | |
|---|---|---|---|
|
Galletti_Multimodal_2025.pdf
accesso aperto
Note: https://ceur-ws.org/Vol-3984/p06.pdf
Tipologia:
Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza:
Creative commons
Dimensione
2.07 MB
Formato
Adobe PDF
|
2.07 MB | Adobe PDF |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


