Enhancing Alzheimer's Disease Detection Using LLM-Generated Synthetic Data and MultiLevel Embeddings

Venkata Sai Bhargav Mutala,; Pouriyeh, Seyedamin; Parizi, Reza M.; Chloe Yixin Xie,; Santopaolo, Alessandro; Basile, Ilaria; Sannino, Giovanna

doi:10.1109/IJCNN64981.2025.11229368

Abstract—Alzheimer’s disease represents a growing global health concern, emphasizing the need for early diagnosis to mitigate neurocognitive decline. Speech analysis has emerged as a promising, non-invasive approach, yet limited data availability hinders the development of robust Machine Learning (ML) models. To address this challenge, this study exploits the poten- tialities of Large Language Models (LLMs)—both their ability to generate synthetic data and their capacity to extract complex linguistic features from speech. We employ GPT-4 to generate synthetic transcripts, thus expanding the ADReSS2020 dataset and enhancing its diversity while preserving semantic and struc- tural coherence. Moreover, we propose a novel multilevel fea- ture extraction framework that integrates Bidirectional Encoder Representations from Transformers (BERT) embeddings fine- tuned with linguistic features obtained through Computerized Language Analysis (CLAN). The study involved two experiments: first, to identify the optimal feature extraction strategy and second, to evaluate the impact of synthetic data generated by GPT-4. In both experiments, the performance of five classifiers was evaluated to determine the most effective configuration. Our results demonstrated that fine-tuned BERT embeddings slightly improve classification performance compared to pre- trained models, highlighting the value of domain-specific fine- tuning. Although adding CLAN-like linguistic features yielded limited benefits, GPT-4-generated synthetic data demonstrated promising potential, particularly when combined with sentence embeddings. Classifiers such as Random Forest showed an improvement in accuracy, increasing from 0.79 to 0.88 when using the augmented dataset. This study paves the way for the use of LLMs to expand the diversity of datasets and improve the robustness of ML models in clinical applications.

Enhancing Alzheimer's Disease Detection Using LLM-Generated Synthetic Data and MultiLevel Embeddings / Sai Bhargav Mutala, Venkata; Pouriyeh, Seyedamin; Parizi, Reza M.; Yixin Xie, Chloe; Santopaolo, Alessandro; Basile, Ilaria; Sannino, Giovanna. - (2025). (Intervento presentato al convegno 2025 International Joint Conference on Neural Networks (IJCNN) tenutosi a Roma) [10.1109/IJCNN64981.2025.11229368].

Enhancing Alzheimer's Disease Detection Using LLM-Generated Synthetic Data and MultiLevel Embeddings

Venkata Sai Bhargav Mutala;Seyedamin Pouriyeh;Reza M. Parizi;Chloe Yixin Xie;Alessandro Santopaolo;Ilaria Basile;Giovanna Sannino

2025

Abstract

Abstract—Alzheimer’s disease represents a growing global health concern, emphasizing the need for early diagnosis to mitigate neurocognitive decline. Speech analysis has emerged as a promising, non-invasive approach, yet limited data availability hinders the development of robust Machine Learning (ML) models. To address this challenge, this study exploits the poten- tialities of Large Language Models (LLMs)—both their ability to generate synthetic data and their capacity to extract complex linguistic features from speech. We employ GPT-4 to generate synthetic transcripts, thus expanding the ADReSS2020 dataset and enhancing its diversity while preserving semantic and struc- tural coherence. Moreover, we propose a novel multilevel fea- ture extraction framework that integrates Bidirectional Encoder Representations from Transformers (BERT) embeddings fine- tuned with linguistic features obtained through Computerized Language Analysis (CLAN). The study involved two experiments: first, to identify the optimal feature extraction strategy and second, to evaluate the impact of synthetic data generated by GPT-4. In both experiments, the performance of five classifiers was evaluated to determine the most effective configuration. Our results demonstrated that fine-tuned BERT embeddings slightly improve classification performance compared to pre- trained models, highlighting the value of domain-specific fine- tuning. Although adding CLAN-like linguistic features yielded limited benefits, GPT-4-generated synthetic data demonstrated promising potential, particularly when combined with sentence embeddings. Classifiers such as Random Forest showed an improvement in accuracy, increasing from 0.79 to 0.88 when using the augmented dataset. This study paves the way for the use of LLMs to expand the diversity of datasets and improve the robustness of ML models in clinical applications.

Scheda breve

Scheda completa

	Anno di pubblicazione
	
				2025
			
	Nome convegno
	
				2025 International Joint Conference on Neural Networks (IJCNN)
			
	Parole chiave
	
				alzheimer’s disease; large language models; data augmentation; machine learning; bert
			
	Tipologia
	
				04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
			
	Citazione
	
				Enhancing Alzheimer's Disease Detection Using LLM-Generated Synthetic Data and MultiLevel Embeddings / Sai Bhargav Mutala, Venkata; Pouriyeh, Seyedamin; Parizi, Reza M.; Yixin Xie, Chloe; Santopaolo, Alessandro; Basile, Ilaria; Sannino, Giovanna. - (2025). (Intervento presentato al  convegno 2025 International Joint Conference on Neural Networks (IJCNN) tenutosi a Roma) [10.1109/IJCNN64981.2025.11229368].

File allegati a questo prodotto

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1754397

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

ND

ND

Catalogo dei prodotti della ricerca