Syncfusion: Multimodal Onset-Synchronized Video-to-Audio Foley Synthesis

M. Comunità; R. F. Gramaccioni; E. Postolache; E. Rodolà; D. Comminiello; J. D. Reiss

doi:10.1109/ICASSP48485.2024.10447063

Sound design involves creatively selecting, recording, and editing sound effects for various media like cinema, video games, and virtual/augmented reality. One of the most time-consuming steps when designing sound is synchronizing audio with video. In some cases, environmental recordings from video shoots are available, which can aid in the process. However, in video games and animations, no reference audio exists, requiring manual annotation of event timings from the video. We propose a system to extract repetitive actions onsets from a video, which are then used - in conjunction with audio or textual embeddings - to condition a diffusion model trained to generate a new synchronized sound effects audio track. In this way, we leave complete creative control to the sound designer while removing the burden of synchronization with video. Furthermore, editing the onset track or changing the conditioning embedding requires much less effort than editing the audio track itself, simplifying the sonification process. We provide sound examples, source code, and pretrained models to faciliate reproducibility

Syncfusion: Multimodal Onset-Synchronized Video-to-Audio Foley Synthesis / Comunità, M.; Gramaccioni, R. F.; Postolache, E.; Rodolà, E.; Comminiello, D.; Reiss, J. D.. - (2024), pp. 936-940. (Intervento presentato al convegno ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) tenutosi a Seoul, Korea, Republic of) [10.1109/ICASSP48485.2024.10447063].

Syncfusion: Multimodal Onset-Synchronized Video-to-Audio Foley Synthesis

R. F. Gramaccioni^Co-primo;E. Postolache^Secondo;E. Rodolà;D. Comminiello;J. D. Reiss

2024

Abstract

Sound design involves creatively selecting, recording, and editing sound effects for various media like cinema, video games, and virtual/augmented reality. One of the most time-consuming steps when designing sound is synchronizing audio with video. In some cases, environmental recordings from video shoots are available, which can aid in the process. However, in video games and animations, no reference audio exists, requiring manual annotation of event timings from the video. We propose a system to extract repetitive actions onsets from a video, which are then used - in conjunction with audio or textual embeddings - to condition a diffusion model trained to generate a new synchronized sound effects audio track. In this way, we leave complete creative control to the sound designer while removing the burden of synchronization with video. Furthermore, editing the onset track or changing the conditioning embedding requires much less effort than editing the audio track itself, simplifying the sonification process. We provide sound examples, source code, and pretrained models to faciliate reproducibility

Scheda breve

Scheda completa

	Anno di pubblicazione
	
				2024
			
	Nome convegno
	
				ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
			
	Parole chiave
	
				Sound effects synthesis, foley, diffusion models, audio-video synchronization, multimodal audio synthesis
			
	Tipologia
	
				04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
			
	Citazione
	
				Syncfusion: Multimodal Onset-Synchronized Video-to-Audio Foley Synthesis / Comunità, M.; Gramaccioni, R. F.; Postolache, E.; Rodolà, E.; Comminiello, D.; Reiss, J. D.. - (2024), pp. 936-940. (Intervento presentato al  convegno ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) tenutosi a Seoul, Korea, Republic of) [10.1109/ICASSP48485.2024.10447063].

File allegati a questo prodotto

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1725167

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

ND

ND

Catalogo dei prodotti della ricerca