Catalogo dei prodotti della ricerca

In this work, we present FoleyGRAM, a novel approach to video-to-audio generation that emphasizes semantic conditioning through the use of aligned multimodal encoders. Building on prior advancements in video-to-audio generation, FoleyGRAM leverages the Gramian Representation Alignment Measure (GRAM) to align embeddings across video, text, and audio modalities, enabling precise semantic control over the audio generation process. The core of FoleyGRAM is a diffusion-based audio synthesis model conditioned on GRAM-aligned embeddings and waveform envelopes, ensuring both semantic richness and temporal alignment with the corresponding input video. We evaluate FoleyGRAM on the Greatest Hits dataset, a standard benchmark for video-to-audio models. Our experiments demonstrate that aligning multimodal encoders using GRAM enhances the system’s ability to semantically align generated audio with video content, advancing the state of the art in video-to-audio synthesis.

FoleyGRAM. Video-to-audio generation with GRAM-aligned multimodal encoders / Gramaccioni, Riccardo Fosco; Marinoni, Christian; Grassucci, Eleonora; Cicchetti, Giordano; Uncini, Aurelio; Comminiello, Danilo. - (2025). ( International Joint Conference on Neural Networks (IJCNN 2025) Rome; Italy ) [10.1109/IJCNN64981.2025.11229067].

FoleyGRAM. Video-to-audio generation with GRAM-aligned multimodal encoders

Riccardo Fosco Gramaccioni^{Co-primo

Methodology};Christian Marinoni^{Co-primo

Software};Eleonora Grassucci^{Secondo

Conceptualization};Giordano Cicchetti^Validation;Aurelio Uncini^{Penultimo

Supervision};Danilo Comminiello^{Ultimo

Supervision}

2025

Abstract

In this work, we present FoleyGRAM, a novel approach to video-to-audio generation that emphasizes semantic conditioning through the use of aligned multimodal encoders. Building on prior advancements in video-to-audio generation, FoleyGRAM leverages the Gramian Representation Alignment Measure (GRAM) to align embeddings across video, text, and audio modalities, enabling precise semantic control over the audio generation process. The core of FoleyGRAM is a diffusion-based audio synthesis model conditioned on GRAM-aligned embeddings and waveform envelopes, ensuring both semantic richness and temporal alignment with the corresponding input video. We evaluate FoleyGRAM on the Greatest Hits dataset, a standard benchmark for video-to-audio models. Our experiments demonstrate that aligning multimodal encoders using GRAM enhances the system’s ability to semantically align generated audio with video content, advancing the state of the art in video-to-audio synthesis.

Scheda breve

Scheda completa

	Anno di pubblicazione
	
				2025
			
	Nome convegno
	
				International Joint Conference on Neural Networks (IJCNN 2025)
			
	Parole chiave
	
				semantically-aligned generation; video-to-audio synthesis; sound design; multimodal conditioning
			
	Tipologia
	
				04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
			
	Citazione
	
				FoleyGRAM. Video-to-audio generation with GRAM-aligned multimodal encoders / Gramaccioni, Riccardo Fosco; Marinoni, Christian; Grassucci, Eleonora; Cicchetti, Giordano; Uncini, Aurelio; Comminiello, Danilo. - (2025). ( International Joint Conference on Neural Networks (IJCNN 2025) Rome; Italy ) [10.1109/IJCNN64981.2025.11229067].
			
	Appartiene alla tipologia:
	
				04b Atto di convegno in volume

File allegati a questo prodotto

File	Dimensione	Formato
Gramaccioni_FoleyGRAM_2025.pdf solo gestori archivio Tipologia: Documento in Post-print (versione successiva alla peer review e accettata per la pubblicazione) Licenza: Tutti i diritti riservati (All rights reserved) Dimensione 6.25 MB Formato Adobe PDF Contatta l'autore	6.25 MB	Adobe PDF	Contatta l'autore

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1764468

Citazioni

ND

ND

ND

social impact