We provide the first systematic assessment of data leakage issues in the use of machine learning on panel data. Our organising framework clarifies why neglecting the cross-sectional and longitudinal structure of these data leads to hard-to-detect data leakage, inflated out-of-sample performance, and an inadvertent overestimation of the real-world usefulness and applicability of machine learning models. We then offer empirical guidelines for practitioners to ensure the correct implementation of supervised machine learning in panel data environments. An empirical application, using data from over 3000 U.S. counties spanning 2000 to 2019 and focused on income prediction, illustrates the practical relevance of these points across nearly 500 models for both classification and regression tasks.

On the (Mis)Use of Machine Learning With Panel Data / Cerqua, A.; Letta, M.; Pinto, G.. - In: OXFORD BULLETIN OF ECONOMICS AND STATISTICS. - ISSN 0305-9049. - (2025). [10.1111/obes.70019]

On the (Mis)Use of Machine Learning With Panel Data

Cerqua A.
;
Letta M.;Pinto G.
2025

Abstract

We provide the first systematic assessment of data leakage issues in the use of machine learning on panel data. Our organising framework clarifies why neglecting the cross-sectional and longitudinal structure of these data leads to hard-to-detect data leakage, inflated out-of-sample performance, and an inadvertent overestimation of the real-world usefulness and applicability of machine learning models. We then offer empirical guidelines for practitioners to ensure the correct implementation of supervised machine learning in panel data environments. An empirical application, using data from over 3000 U.S. counties spanning 2000 to 2019 and focused on income prediction, illustrates the practical relevance of these points across nearly 500 models for both classification and regression tasks.
2025
data leakage; machine learning; panel data; prediction policy problems
01 Pubblicazione su rivista::01a Articolo in rivista
On the (Mis)Use of Machine Learning With Panel Data / Cerqua, A.; Letta, M.; Pinto, G.. - In: OXFORD BULLETIN OF ECONOMICS AND STATISTICS. - ISSN 0305-9049. - (2025). [10.1111/obes.70019]
File allegati a questo prodotto
File Dimensione Formato  
Cerqua_Letta_Pinto (2025) On the Mis Use of Machine Learning With Panel Data.pdf

accesso aperto

Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza: Creative commons
Dimensione 8.03 MB
Formato Adobe PDF
8.03 MB Adobe PDF

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1747867
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact