In this work, we propose a neural network approach for speech reconstruction from mel spectrograms, a crucial task in achieving high-quality data after processing speech signals in the time-frequency domain. Specifically, we propose a two-stage deep learning approach based on an overcomplete deep autoencoder (DAE) for the mel filter bank inversion coupled with the deep version of the Griffin-Lim (DeGLI) algorithm for the phase information recovery. After the pre-training of both parts of the architecture, a final fine-tuning on the whole system is performed. Some numerical results, evaluated on the well-known TIMIT dataset, demonstrate the effectiveness of the proposed idea by obtaining a PESQ of 3.996, a STOI equal to 0.994, and a mean opinion score evaluated as 4.15.

A two-stage neural network for speech signal reconstruction from Mel spectrograms / Villani, Filippo; Scarpiniti, Michele; Uncini, Aurelio. - (2025), pp. 267-278. - SMART INNOVATION, SYSTEMS AND TECHNOLOGIES. [10.1007/978-981-96-0994-9_25].

A two-stage neural network for speech signal reconstruction from Mel spectrograms

Villani, Filippo;Scarpiniti, Michele
;
Uncini, Aurelio
2025

Abstract

In this work, we propose a neural network approach for speech reconstruction from mel spectrograms, a crucial task in achieving high-quality data after processing speech signals in the time-frequency domain. Specifically, we propose a two-stage deep learning approach based on an overcomplete deep autoencoder (DAE) for the mel filter bank inversion coupled with the deep version of the Griffin-Lim (DeGLI) algorithm for the phase information recovery. After the pre-training of both parts of the architecture, a final fine-tuning on the whole system is performed. Some numerical results, evaluated on the well-known TIMIT dataset, demonstrate the effectiveness of the proposed idea by obtaining a PESQ of 3.996, a STOI equal to 0.994, and a mean opinion score evaluated as 4.15.
2025
Advanced Neural Artificial Intelligence: Theories and Applications
9789819609932
9789819609949
Mel spectrogram inversion; phase reconstruction; speech enhancement; time-frequency representation; deep autoencoder
02 Pubblicazione su volume::02a Capitolo o Articolo
A two-stage neural network for speech signal reconstruction from Mel spectrograms / Villani, Filippo; Scarpiniti, Michele; Uncini, Aurelio. - (2025), pp. 267-278. - SMART INNOVATION, SYSTEMS AND TECHNOLOGIES. [10.1007/978-981-96-0994-9_25].
File allegati a questo prodotto
File Dimensione Formato  
Villani_Two-stage_2025.pdf

solo gestori archivio

Note: Mel-Inversion_editoriale
Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 286.63 kB
Formato Adobe PDF
286.63 kB Adobe PDF   Contatta l'autore
Villani_postprint-Two-stage_2025.pdf

solo gestori archivio

Note: Mel-Inversion_Postprint
Tipologia: Documento in Post-print (versione successiva alla peer review e accettata per la pubblicazione)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 364.09 kB
Formato Adobe PDF
364.09 kB Adobe PDF   Contatta l'autore

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1740169
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact