Abstract—Alzheimer’s disease represents a growing global health concern, emphasizing the need for early diagnosis to mitigate neurocognitive decline. Speech analysis has emerged as a promising, non-invasive approach, yet limited data availability hinders the development of robust Machine Learning (ML) models. To address this challenge, this study exploits the poten- tialities of Large Language Models (LLMs)—both their ability to generate synthetic data and their capacity to extract complex linguistic features from speech. We employ GPT-4 to generate synthetic transcripts, thus expanding the ADReSS2020 dataset and enhancing its diversity while preserving semantic and struc- tural coherence. Moreover, we propose a novel multilevel fea- ture extraction framework that integrates Bidirectional Encoder Representations from Transformers (BERT) embeddings fine- tuned with linguistic features obtained through Computerized Language Analysis (CLAN). The study involved two experiments: first, to identify the optimal feature extraction strategy and second, to evaluate the impact of synthetic data generated by GPT-4. In both experiments, the performance of five classifiers was evaluated to determine the most effective configuration. Our results demonstrated that fine-tuned BERT embeddings slightly improve classification performance compared to pre- trained models, highlighting the value of domain-specific fine- tuning. Although adding CLAN-like linguistic features yielded limited benefits, GPT-4-generated synthetic data demonstrated promising potential, particularly when combined with sentence embeddings. Classifiers such as Random Forest showed an improvement in accuracy, increasing from 0.79 to 0.88 when using the augmented dataset. This study paves the way for the use of LLMs to expand the diversity of datasets and improve the robustness of ML models in clinical applications.

Enhancing Alzheimer's Disease Detection Using LLM-Generated Synthetic Data and MultiLevel Embeddings / Sai Bhargav Mutala, Venkata; Pouriyeh, Seyedamin; Parizi, Reza M.; Yixin Xie, Chloe; Santopaolo, Alessandro; Basile, Ilaria; Sannino, Giovanna. - (2025). (Intervento presentato al convegno 2025 International Joint Conference on Neural Networks (IJCNN) tenutosi a Roma) [10.1109/IJCNN64981.2025.11229368].

Enhancing Alzheimer's Disease Detection Using LLM-Generated Synthetic Data and MultiLevel Embeddings

Alessandro Santopaolo;
2025

Abstract

Abstract—Alzheimer’s disease represents a growing global health concern, emphasizing the need for early diagnosis to mitigate neurocognitive decline. Speech analysis has emerged as a promising, non-invasive approach, yet limited data availability hinders the development of robust Machine Learning (ML) models. To address this challenge, this study exploits the poten- tialities of Large Language Models (LLMs)—both their ability to generate synthetic data and their capacity to extract complex linguistic features from speech. We employ GPT-4 to generate synthetic transcripts, thus expanding the ADReSS2020 dataset and enhancing its diversity while preserving semantic and struc- tural coherence. Moreover, we propose a novel multilevel fea- ture extraction framework that integrates Bidirectional Encoder Representations from Transformers (BERT) embeddings fine- tuned with linguistic features obtained through Computerized Language Analysis (CLAN). The study involved two experiments: first, to identify the optimal feature extraction strategy and second, to evaluate the impact of synthetic data generated by GPT-4. In both experiments, the performance of five classifiers was evaluated to determine the most effective configuration. Our results demonstrated that fine-tuned BERT embeddings slightly improve classification performance compared to pre- trained models, highlighting the value of domain-specific fine- tuning. Although adding CLAN-like linguistic features yielded limited benefits, GPT-4-generated synthetic data demonstrated promising potential, particularly when combined with sentence embeddings. Classifiers such as Random Forest showed an improvement in accuracy, increasing from 0.79 to 0.88 when using the augmented dataset. This study paves the way for the use of LLMs to expand the diversity of datasets and improve the robustness of ML models in clinical applications.
2025
2025 International Joint Conference on Neural Networks (IJCNN)
alzheimer’s disease; large language models; data augmentation; machine learning; bert
04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
Enhancing Alzheimer's Disease Detection Using LLM-Generated Synthetic Data and MultiLevel Embeddings / Sai Bhargav Mutala, Venkata; Pouriyeh, Seyedamin; Parizi, Reza M.; Yixin Xie, Chloe; Santopaolo, Alessandro; Basile, Ilaria; Sannino, Giovanna. - (2025). (Intervento presentato al convegno 2025 International Joint Conference on Neural Networks (IJCNN) tenutosi a Roma) [10.1109/IJCNN64981.2025.11229368].
File allegati a questo prodotto
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1754397
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact