Catalogo dei prodotti della ricerca

Background: Classifying samples in incomplete datasets is a common aim for machine learning practitioners, but is non-trivial. Missing data is found in most real-world datasets and these missing values are typically imputed using established methods, followed by classification of the now complete samples. The focus of the machine learning researcher is to optimise the classifier’s performance. Methods: We utilise three simulated and three real-world clinical datasets with different feature types and missingness patterns. Initially, we evaluate how the downstream classifier performance depends on the choice of classifier and imputation methods. We employ ANOVA to quantitatively evaluate how the choice of missingness rate, imputation method, and classifier method influences the performance. Additionally, we compare commonly used methods for assessing imputation quality and introduce a class of discrepancy scores based on the sliced Wasserstein distance. We also assess the stability of the imputations and the interpretability of model built on the imputed data. Results: The performance of the classifier is most affected by the percentage of missingness in the test data, with a considerable performance decline observed as the test missingness rate increases. We also show that the commonly used measures for assessing imputation quality tend to lead to imputed data which poorly matches the underlying data distribution, whereas our new class of discrepancy scores performs much better on this measure. Furthermore, we show that the interpretability of classifier models trained using poorly imputed data is compromised. Conclusions: It is imperative to consider the quality of the imputation when performing downstream classification as the effects on the classifier can be considerable.

The impact of imputation quality on machine learning classifiers for datasets with missing values / Shadbahr, T.; Roberts, M.; Stanczuk, J.; Gilbey, J.; Teare, P.; Dittmer, S.; Thorpe, M.; Torne, R. V.; Sala, E.; Lio, P.; Patel, M.; Preller, J.; Lio, P.; Walton, N.; Holzer, M.; Wassin, M.; Escudero Sanchez, L.; Babar, J.; Prosch, H.; Yang, G.; Langs, G.; Jefferson, E.; Korhonen, A.; Gkrania-Klotsas, E.; Weir-McCall, J. R.; Breger, A.; Selby, I.; Rudd, J. H. F.; Mirtti, T.; Rannikko, A. S.; Aston, J. A. D.; Tang, J.; Schonlieb, C. -B.. - In: COMMUNICATIONS MEDICINE. - ISSN 2730-664X. - 3:1(2023). [10.1038/s43856-023-00356-z]

The impact of imputation quality on machine learning classifiers for datasets with missing values

Shadbahr T.;Roberts M.;Stanczuk J.;Gilbey J.;Teare P.;Dittmer S.;Thorpe M.;Torne R. V.;Sala E.;Lio P.;Patel M.;Preller J.;Lio P.;Walton N.;Holzer M.;Wassin M.;Escudero Sanchez L.;Babar J.;Prosch H.;Yang G.;Langs G.;Jefferson E.;Korhonen A.;Gkrania-Klotsas E.;Weir-McCall J. R.;Breger A.;Selby I.;Rudd J. H. F.;Mirtti T.;Rannikko A. S.;Aston J. A. D.;Tang J.;Schonlieb C. -B.

2023

Abstract

Background: Classifying samples in incomplete datasets is a common aim for machine learning practitioners, but is non-trivial. Missing data is found in most real-world datasets and these missing values are typically imputed using established methods, followed by classification of the now complete samples. The focus of the machine learning researcher is to optimise the classifier’s performance. Methods: We utilise three simulated and three real-world clinical datasets with different feature types and missingness patterns. Initially, we evaluate how the downstream classifier performance depends on the choice of classifier and imputation methods. We employ ANOVA to quantitatively evaluate how the choice of missingness rate, imputation method, and classifier method influences the performance. Additionally, we compare commonly used methods for assessing imputation quality and introduce a class of discrepancy scores based on the sliced Wasserstein distance. We also assess the stability of the imputations and the interpretability of model built on the imputed data. Results: The performance of the classifier is most affected by the percentage of missingness in the test data, with a considerable performance decline observed as the test missingness rate increases. We also show that the commonly used measures for assessing imputation quality tend to lead to imputed data which poorly matches the underlying data distribution, whereas our new class of discrepancy scores performs much better on this measure. Furthermore, we show that the interpretability of classifier models trained using poorly imputed data is compromised. Conclusions: It is imperative to consider the quality of the imputation when performing downstream classification as the effects on the classifier can be considerable.

Scheda breve

Scheda completa

	Anno di pubblicazione
	
				2023
			
	Parole chiave
	
				Medicine
			
	Tipologia
	
				01 Pubblicazione su rivista::01a Articolo in rivista
			
	Citazione
	
				The impact of imputation quality on machine learning classifiers for datasets with missing values / Shadbahr, T.; Roberts, M.; Stanczuk, J.; Gilbey, J.; Teare, P.; Dittmer, S.; Thorpe, M.; Torne, R. V.; Sala, E.; Lio, P.; Patel, M.; Preller, J.; Lio, P.; Walton, N.; Holzer, M.; Wassin, M.; Escudero Sanchez, L.; Babar, J.; Prosch, H.; Yang, G.; Langs, G.; Jefferson, E.; Korhonen, A.; Gkrania-Klotsas, E.; Weir-McCall, J. R.; Breger, A.; Selby, I.; Rudd, J. H. F.; Mirtti, T.; Rannikko, A. S.; Aston, J. A. D.; Tang, J.; Schonlieb, C. -B.. - In: COMMUNICATIONS MEDICINE. - ISSN 2730-664X. - 3:1(2023). [10.1038/s43856-023-00356-z]
			
	Appartiene alla tipologia:
	
				01a Articolo in rivista

File allegati a questo prodotto

File	Dimensione	Formato
Shadbahr_The-impact_2023.pdf accesso aperto Note: https://www.nature.com/articles/s43856-023-00356-z.pdf Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore) Licenza: Creative commons Dimensione 2.28 MB Formato Adobe PDF	2.28 MB	Adobe PDF

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1721241

Citazioni

5

20

17

social impact