Mitigating the Burden of Redundant Datasets via Batch-Wise Unique Samples and Frequency-Aware Losses

Crisostomi, Donato; Caciolai, Andrea; Pedrani, Alessandro; Rottmann, Kay; Manzotti, Alessandro; Palumbo, Enrico; Bernardi, Davide

doi:10.18653/v1/2023.acl-industry.23

Datasets used to train deep learning models in industrial settings often exhibit skewed distributions with some samples repeated a large number of times. This paper presents a simple yet effective solution to reduce the increased burden of repeated computation on redundant datasets. Our approach eliminates duplicates at the batch level, without altering the data distribution observed by the model, making it model-agnostic and easy to implement as a plug-and-play module. We also provide a mathematical expression to estimate the reduction in training time that our approach provides. Through empirical evidence, we show that our approach significantly reduces training times on various models across datasets with varying redundancy factors, without impacting their performance on the Named Entity Recognition task, both on publicly available datasets and in real industrial settings. In the latter, the approach speeds training by up to 87%, and by 46% on average, with a drop in model performance of 0.2% relative at worst. We finally release a modular and reusable codebase to further advance research in this area.

Mitigating the Burden of Redundant Datasets via Batch-Wise Unique Samples and Frequency-Aware Losses / Crisostomi, Donato; Caciolai, Andrea; Pedrani, Alessandro; Rottmann, Kay; Manzotti, Alessandro; Palumbo, Enrico; Bernardi, Davide. - (2023), pp. -247. (Intervento presentato al convegno The 61st Annual Meeting of the Association for Computational Linguistics (ACL) tenutosi a Toronto) [10.18653/v1/2023.acl-industry.23].

Mitigating the Burden of Redundant Datasets via Batch-Wise Unique Samples and Frequency-Aware Losses

Crisostomi, Donato^Co-primo;Caciolai, Andrea;Pedrani, Alessandro;Rottmann, Kay;Manzotti, Alessandro;Palumbo, Enrico;Bernardi, Davide

2023

Abstract

Datasets used to train deep learning models in industrial settings often exhibit skewed distributions with some samples repeated a large number of times. This paper presents a simple yet effective solution to reduce the increased burden of repeated computation on redundant datasets. Our approach eliminates duplicates at the batch level, without altering the data distribution observed by the model, making it model-agnostic and easy to implement as a plug-and-play module. We also provide a mathematical expression to estimate the reduction in training time that our approach provides. Through empirical evidence, we show that our approach significantly reduces training times on various models across datasets with varying redundancy factors, without impacting their performance on the Named Entity Recognition task, both on publicly available datasets and in real industrial settings. In the latter, the approach speeds training by up to 87%, and by 46% on average, with a drop in model performance of 0.2% relative at worst. We finally release a modular and reusable codebase to further advance research in this area.

Scheda breve

Scheda completa

	Anno di pubblicazione
	
				2023
			
	Nome convegno
	
				The 61st Annual Meeting of the Association for Computational Linguistics (ACL)
			
	Parole chiave
	
				Deep learning; deduplication; efficient training
			
	Tipologia
	
				04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
			
	Citazione
	
				Mitigating the Burden of Redundant Datasets via Batch-Wise Unique Samples and Frequency-Aware Losses / Crisostomi, Donato; Caciolai, Andrea; Pedrani, Alessandro; Rottmann, Kay; Manzotti, Alessandro; Palumbo, Enrico; Bernardi, Davide. - (2023), pp. -247. (Intervento presentato al  convegno The 61st Annual Meeting of the Association for Computational Linguistics (ACL) tenutosi a Toronto) [10.18653/v1/2023.acl-industry.23].

File allegati a questo prodotto

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1698997

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

ND

ND

Catalogo dei prodotti della ricerca