In this work, we present FoleyGRAM, a novel approach to video-to-audio generation that emphasizes semantic conditioning through the use of aligned multimodal encoders. Building on prior advancements in video-to-audio generation, FoleyGRAM leverages the Gramian Representation Alignment Measure (GRAM) to align embeddings across video, text, and audio modalities, enabling precise semantic control over the audio generation process. The core of FoleyGRAM is a diffusion-based audio synthesis model conditioned on GRAM-aligned embeddings and waveform envelopes, ensuring both semantic richness and temporal alignment with the corresponding input video. We evaluate FoleyGRAM on the Greatest Hits dataset, a standard benchmark for video-to-audio models. Our experiments demonstrate that aligning multimodal encoders using GRAM enhances the system’s ability to semantically align generated audio with video content, advancing the state of the art in video-to-audio synthesis.

FoleyGRAM: Video-to-audio generation with GRAM-aligned multimodal encoders / Gramaccioni, Riccardo Fosco; Marinoni, Christian; Grassucci, Eleonora; Cicchetti, Giordano; Uncini, Aurelio; Comminiello, Danilo. - (2025). ( International Joint Conference on Neural Networks (IJCNN) Roma, Italia ).

FoleyGRAM: Video-to-audio generation with GRAM-aligned multimodal encoders

Riccardo Fosco Gramaccioni
Co-primo
Methodology
;
Christian Marinoni
Co-primo
Software
;
Eleonora Grassucci
Secondo
Conceptualization
;
Giordano Cicchetti
Validation
;
Aurelio Uncini
Penultimo
Supervision
;
Danilo Comminiello
Ultimo
Supervision
2025

Abstract

In this work, we present FoleyGRAM, a novel approach to video-to-audio generation that emphasizes semantic conditioning through the use of aligned multimodal encoders. Building on prior advancements in video-to-audio generation, FoleyGRAM leverages the Gramian Representation Alignment Measure (GRAM) to align embeddings across video, text, and audio modalities, enabling precise semantic control over the audio generation process. The core of FoleyGRAM is a diffusion-based audio synthesis model conditioned on GRAM-aligned embeddings and waveform envelopes, ensuring both semantic richness and temporal alignment with the corresponding input video. We evaluate FoleyGRAM on the Greatest Hits dataset, a standard benchmark for video-to-audio models. Our experiments demonstrate that aligning multimodal encoders using GRAM enhances the system’s ability to semantically align generated audio with video content, advancing the state of the art in video-to-audio synthesis.
2025
International Joint Conference on Neural Networks (IJCNN)
semantically-aligned generation , video-to-audio synthesis , sound design , multimodal conditioning
04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
FoleyGRAM: Video-to-audio generation with GRAM-aligned multimodal encoders / Gramaccioni, Riccardo Fosco; Marinoni, Christian; Grassucci, Eleonora; Cicchetti, Giordano; Uncini, Aurelio; Comminiello, Danilo. - (2025). ( International Joint Conference on Neural Networks (IJCNN) Roma, Italia ).
File allegati a questo prodotto
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1764468
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact