Abstract—Alzheimer’s disease represents a growing global health concern, emphasizing the need for early diagnosis to mitigate neurocognitive decline. Speech analysis has emerged as a promising, non-invasive approach, yet limited data availability hinders the development of robust Machine Learning (ML) models. To address this challenge, this study exploits the poten- tialities of Large Language Models (LLMs)—both their ability to generate synthetic data and their capacity to extract complex linguistic features from speech. We employ GPT-4 to generate synthetic transcripts, thus expanding the ADReSS2020 dataset and enhancing its diversity while preserving semantic and struc- tural coherence. Moreover, we propose a novel multilevel fea- ture extraction framework that integrates Bidirectional Encoder Representations from Transformers (BERT) embeddings fine- tuned with linguistic features obtained through Computerized Language Analysis (CLAN). The study involved two experiments: first, to identify the optimal feature extraction strategy and second, to evaluate the impact of synthetic data generated by GPT-4. In both experiments, the performance of five classifiers was evaluated to determine the most effective configuration. Our results demonstrated that fine-tuned BERT embeddings slightly improve classification performance compared to pre- trained models, highlighting the value of domain-specific fine- tuning. Although adding CLAN-like linguistic features yielded limited benefits, GPT-4-generated synthetic data demonstrated promising potential, particularly when combined with sentence embeddings. Classifiers such as Random Forest showed an improvement in accuracy, increasing from 0.79 to 0.88 when using the augmented dataset. This study paves the way for the use of LLMs to expand the diversity of datasets and improve the robustness of ML models in clinical applications.
Enhancing Alzheimer's Disease Detection Using LLM-Generated Synthetic Data and MultiLevel Embeddings / Sai Bhargav Mutala, Venkata; Pouriyeh, Seyedamin; Parizi, Reza M.; Yixin Xie, Chloe; Santopaolo, Alessandro; Basile, Ilaria; Sannino, Giovanna. - (2025). (Intervento presentato al convegno 2025 International Joint Conference on Neural Networks (IJCNN) tenutosi a Roma) [10.1109/IJCNN64981.2025.11229368].
Enhancing Alzheimer's Disease Detection Using LLM-Generated Synthetic Data and MultiLevel Embeddings
Alessandro Santopaolo;
2025
Abstract
Abstract—Alzheimer’s disease represents a growing global health concern, emphasizing the need for early diagnosis to mitigate neurocognitive decline. Speech analysis has emerged as a promising, non-invasive approach, yet limited data availability hinders the development of robust Machine Learning (ML) models. To address this challenge, this study exploits the poten- tialities of Large Language Models (LLMs)—both their ability to generate synthetic data and their capacity to extract complex linguistic features from speech. We employ GPT-4 to generate synthetic transcripts, thus expanding the ADReSS2020 dataset and enhancing its diversity while preserving semantic and struc- tural coherence. Moreover, we propose a novel multilevel fea- ture extraction framework that integrates Bidirectional Encoder Representations from Transformers (BERT) embeddings fine- tuned with linguistic features obtained through Computerized Language Analysis (CLAN). The study involved two experiments: first, to identify the optimal feature extraction strategy and second, to evaluate the impact of synthetic data generated by GPT-4. In both experiments, the performance of five classifiers was evaluated to determine the most effective configuration. Our results demonstrated that fine-tuned BERT embeddings slightly improve classification performance compared to pre- trained models, highlighting the value of domain-specific fine- tuning. Although adding CLAN-like linguistic features yielded limited benefits, GPT-4-generated synthetic data demonstrated promising potential, particularly when combined with sentence embeddings. Classifiers such as Random Forest showed an improvement in accuracy, increasing from 0.79 to 0.88 when using the augmented dataset. This study paves the way for the use of LLMs to expand the diversity of datasets and improve the robustness of ML models in clinical applications.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


