In this work, we present FoleyGRAM, a novel approach to video-to-audio generation that emphasizes semantic conditioning through the use of aligned multimodal encoders. Building on prior advancements in video-to-audio generation, FoleyGRAM leverages the Gramian Representation Alignment Measure (GRAM) to align embeddings across video, text, and audio modalities, enabling precise semantic control over the audio generation process. The core of FoleyGRAM is a diffusion-based audio synthesis model conditioned on GRAM-aligned embeddings and waveform envelopes, ensuring both semantic richness and temporal alignment with the corresponding input video. We evaluate FoleyGRAM on the Greatest Hits dataset, a standard benchmark for video-to-audio models. Our experiments demonstrate that aligning multimodal encoders using GRAM enhances the system’s ability to semantically align generated audio with video content, advancing the state of the art in video-to-audio synthesis.
FoleyGRAM: Video-to-audio generation with GRAM-aligned multimodal encoders / Gramaccioni, Riccardo Fosco; Marinoni, Christian; Grassucci, Eleonora; Cicchetti, Giordano; Uncini, Aurelio; Comminiello, Danilo. - (2025). ( International Joint Conference on Neural Networks (IJCNN) Roma, Italia ).
FoleyGRAM: Video-to-audio generation with GRAM-aligned multimodal encoders
Riccardo Fosco Gramaccioni
Co-primo
Methodology
;Christian Marinoni
Co-primo
Software
;Eleonora GrassucciSecondo
Conceptualization
;Giordano CicchettiValidation
;Aurelio UnciniPenultimo
Supervision
;Danilo Comminiello
Ultimo
Supervision
2025
Abstract
In this work, we present FoleyGRAM, a novel approach to video-to-audio generation that emphasizes semantic conditioning through the use of aligned multimodal encoders. Building on prior advancements in video-to-audio generation, FoleyGRAM leverages the Gramian Representation Alignment Measure (GRAM) to align embeddings across video, text, and audio modalities, enabling precise semantic control over the audio generation process. The core of FoleyGRAM is a diffusion-based audio synthesis model conditioned on GRAM-aligned embeddings and waveform envelopes, ensuring both semantic richness and temporal alignment with the corresponding input video. We evaluate FoleyGRAM on the Greatest Hits dataset, a standard benchmark for video-to-audio models. Our experiments demonstrate that aligning multimodal encoders using GRAM enhances the system’s ability to semantically align generated audio with video content, advancing the state of the art in video-to-audio synthesis.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


