Visual Sentiment Analysis aims to understand how images affect people in terms of evoked emotions. This paper presents a complete pipeline for comparing users’ emotional responses to images, enabling the analysis of potential discrepancies between machine-inferred and subjective affective states. The proposed framework consists of three main stages. The first stage employs a Convolutional Neural Network (CNN) enhanced with Feature Pyramid Network (FPN) layers to extract multi-scale visual features. Experimental results show that incorporating three additional FPN layers improves performance while introducing only a negligible increase in model complexity. In the second stage, a multimodal approach is adopted, where visual features are integrated with textual features derived from captions generated by an Image Captioning model. This fusion enriches the emotional context by combining visual and linguistic cues. In the final stage, a grounding mechanism is applied to align and merge sentiments from the different modalities into a unified representation. The algorithm’s output is then compared with the sentiment expressed by the user, enabling an analysis of the divergence between machine-inferred and human-perceived emotions.
A Multimodal Visual Sentiment Analysis Framework Enhanced With Feature Pyramid Networks / Galletti, D.; Ponzi, V.; Russo, S.. - 3984:(2025), pp. 55-63. ( 10th International Conference of Yearly Reports on Informatics, Mathematics, and Engineering, ICYRIME 2025 pol ).
A Multimodal Visual Sentiment Analysis Framework Enhanced With Feature Pyramid Networks
Ponzi V.Secondo
Methodology
;Russo S.
Ultimo
Supervision
2025
Abstract
Visual Sentiment Analysis aims to understand how images affect people in terms of evoked emotions. This paper presents a complete pipeline for comparing users’ emotional responses to images, enabling the analysis of potential discrepancies between machine-inferred and subjective affective states. The proposed framework consists of three main stages. The first stage employs a Convolutional Neural Network (CNN) enhanced with Feature Pyramid Network (FPN) layers to extract multi-scale visual features. Experimental results show that incorporating three additional FPN layers improves performance while introducing only a negligible increase in model complexity. In the second stage, a multimodal approach is adopted, where visual features are integrated with textual features derived from captions generated by an Image Captioning model. This fusion enriches the emotional context by combining visual and linguistic cues. In the final stage, a grounding mechanism is applied to align and merge sentiments from the different modalities into a unified representation. The algorithm’s output is then compared with the sentiment expressed by the user, enabling an analysis of the divergence between machine-inferred and human-perceived emotions.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


