Environmental Sound Classification (ESC) is becoming an ever increasingly important application in different scenarios, such as smart cities, autonomous systems, safety, and industrial monitoring. Traditional methods for ESC mainly rely on features extracted from a single-representation, usually spectrograms or MFCCs. However, while deep learning-based CNN models have demonstrated excellent performance, they still suffer from certain limitations due to the reliance on a single feature representation. In this regard, this work exploits a multi-representation strategy by fusing five kinds of audio features, namely: spectrograms, phasograms, scalograms, wavelet phasograms, and MFCC-grams. Each representation captures different properties of the audio. These representations are combined in a structured manner by investigating three fusion strategies: early, intermediate, and late fusion using a novel model based on the EfficientNet, named EfficientAudioNet. The proposed strategies are evaluated on four benchmark datasets: a Construction Site machinery sounds dataset, the ESC-10 and ESC-50 environmental sound datasets, and the UrbanSound8K dataset. Experimental results demonstrate that the multi-representation fusion, specially the early fusion, significantly enhances the classification performance. Overall, the proposed approach overcomes state-of-the-art accuracy on all the tested datasets.

EfficientAudioNet. Enhancing environmental sound classification through data fusion of multiple audio representations / Scarpiniti, Michele; Hussain, Saud; Pu, Wangyi; Uncini, Aurelio; Lee, Yong-Cheol. - (2025), pp. 1-8. ( 2025 International Joint Conference on Neural Networks (IJCNN) Rome, Italy ) [10.1109/ijcnn64981.2025.11227157].

EfficientAudioNet. Enhancing environmental sound classification through data fusion of multiple audio representations

Scarpiniti, Michele
;
Pu, Wangyi;Uncini, Aurelio;
2025

Abstract

Environmental Sound Classification (ESC) is becoming an ever increasingly important application in different scenarios, such as smart cities, autonomous systems, safety, and industrial monitoring. Traditional methods for ESC mainly rely on features extracted from a single-representation, usually spectrograms or MFCCs. However, while deep learning-based CNN models have demonstrated excellent performance, they still suffer from certain limitations due to the reliance on a single feature representation. In this regard, this work exploits a multi-representation strategy by fusing five kinds of audio features, namely: spectrograms, phasograms, scalograms, wavelet phasograms, and MFCC-grams. Each representation captures different properties of the audio. These representations are combined in a structured manner by investigating three fusion strategies: early, intermediate, and late fusion using a novel model based on the EfficientNet, named EfficientAudioNet. The proposed strategies are evaluated on four benchmark datasets: a Construction Site machinery sounds dataset, the ESC-10 and ESC-50 environmental sound datasets, and the UrbanSound8K dataset. Experimental results demonstrate that the multi-representation fusion, specially the early fusion, significantly enhances the classification performance. Overall, the proposed approach overcomes state-of-the-art accuracy on all the tested datasets.
2025
2025 International Joint Conference on Neural Networks (IJCNN)
environmental sound classification (ESC); classification; spectrogram; scalogram; phasogram; MFCC; deep learning
04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
EfficientAudioNet. Enhancing environmental sound classification through data fusion of multiple audio representations / Scarpiniti, Michele; Hussain, Saud; Pu, Wangyi; Uncini, Aurelio; Lee, Yong-Cheol. - (2025), pp. 1-8. ( 2025 International Joint Conference on Neural Networks (IJCNN) Rome, Italy ) [10.1109/ijcnn64981.2025.11227157].
File allegati a questo prodotto
File Dimensione Formato  
Scarpiniti_EfficientAudioNet_2025.pdf

solo gestori archivio

Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 1.06 MB
Formato Adobe PDF
1.06 MB Adobe PDF   Contatta l'autore
Scarpiniti_postprint_EfficientAudioNet_2025.pdf

solo gestori archivio

Note: Versione in post-print
Tipologia: Documento in Post-print (versione successiva alla peer review e accettata per la pubblicazione)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 377.73 kB
Formato Adobe PDF
377.73 kB Adobe PDF   Contatta l'autore
Scarpiniti_copertina_EfficientAudioNet_2025.pdf

solo gestori archivio

Tipologia: Altro materiale allegato
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 13.95 MB
Formato Adobe PDF
13.95 MB Adobe PDF   Contatta l'autore

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1755858
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact