Environmental Sound Classification (ESC) is becoming an ever increasingly important application in different scenarios, such as smart cities, autonomous systems, safety, and industrial monitoring. Traditional methods for ESC mainly rely on features extracted from a single-representation, usually spectrograms or MFCCs. However, while deep learning-based CNN models have demonstrated excellent performance, they still suffer from certain limitations due to the reliance on a single feature representation. In this regard, this work exploits a multi-representation strategy by fusing five kinds of audio features, namely: spectrograms, phasograms, scalograms, wavelet phasograms, and MFCC-grams. Each representation captures different properties of the audio. These representations are combined in a structured manner by investigating three fusion strategies: early, intermediate, and late fusion using a novel model based on the EfficientNet, named EfficientAudioNet. The proposed strategies are evaluated on four benchmark datasets: a Construction Site machinery sounds dataset, the ESC-10 and ESC-50 environmental sound datasets, and the UrbanSound8K dataset. Experimental results demonstrate that the multi-representation fusion, specially the early fusion, significantly enhances the classification performance. Overall, the proposed approach overcomes state-of-the-art accuracy on all the tested datasets.
EfficientAudioNet. Enhancing environmental sound classification through data fusion of multiple audio representations / Scarpiniti, Michele; Hussain, Saud; Pu, Wangyi; Uncini, Aurelio; Lee, Yong-Cheol. - (2025), pp. 1-8. ( 2025 International Joint Conference on Neural Networks (IJCNN) Rome, Italy ) [10.1109/ijcnn64981.2025.11227157].
EfficientAudioNet. Enhancing environmental sound classification through data fusion of multiple audio representations
Scarpiniti, Michele
;Pu, Wangyi;Uncini, Aurelio;
2025
Abstract
Environmental Sound Classification (ESC) is becoming an ever increasingly important application in different scenarios, such as smart cities, autonomous systems, safety, and industrial monitoring. Traditional methods for ESC mainly rely on features extracted from a single-representation, usually spectrograms or MFCCs. However, while deep learning-based CNN models have demonstrated excellent performance, they still suffer from certain limitations due to the reliance on a single feature representation. In this regard, this work exploits a multi-representation strategy by fusing five kinds of audio features, namely: spectrograms, phasograms, scalograms, wavelet phasograms, and MFCC-grams. Each representation captures different properties of the audio. These representations are combined in a structured manner by investigating three fusion strategies: early, intermediate, and late fusion using a novel model based on the EfficientNet, named EfficientAudioNet. The proposed strategies are evaluated on four benchmark datasets: a Construction Site machinery sounds dataset, the ESC-10 and ESC-50 environmental sound datasets, and the UrbanSound8K dataset. Experimental results demonstrate that the multi-representation fusion, specially the early fusion, significantly enhances the classification performance. Overall, the proposed approach overcomes state-of-the-art accuracy on all the tested datasets.| File | Dimensione | Formato | |
|---|---|---|---|
|
Scarpiniti_EfficientAudioNet_2025.pdf
solo gestori archivio
Tipologia:
Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza:
Tutti i diritti riservati (All rights reserved)
Dimensione
1.06 MB
Formato
Adobe PDF
|
1.06 MB | Adobe PDF | Contatta l'autore |
|
Scarpiniti_postprint_EfficientAudioNet_2025.pdf
solo gestori archivio
Note: Versione in post-print
Tipologia:
Documento in Post-print (versione successiva alla peer review e accettata per la pubblicazione)
Licenza:
Tutti i diritti riservati (All rights reserved)
Dimensione
377.73 kB
Formato
Adobe PDF
|
377.73 kB | Adobe PDF | Contatta l'autore |
|
Scarpiniti_copertina_EfficientAudioNet_2025.pdf
solo gestori archivio
Tipologia:
Altro materiale allegato
Licenza:
Tutti i diritti riservati (All rights reserved)
Dimensione
13.95 MB
Formato
Adobe PDF
|
13.95 MB | Adobe PDF | Contatta l'autore |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


