Catalogo dei prodotti della ricerca

Modern Large Language Models (LLMs) are commonly trained through a multi-stage pipeline encompassing pretraining and supervised finetuning. While recent studies have extensively investigated the benefits of continual pretraining on high-quality data, these efforts have focused primarily on English. In this work, we explore the effectiveness of various data mixtures in a continual pretraining setting to enhance performance on Italian-language tasks. Leveraging Minerva-7B, a fully opensource LLM pretrained on a corpus composed of 50% Italian, we define and evaluate three distinct data recipes–comprising mathematical, encyclopedic, and copyrighted content–spanning both Italian and English. We also investigate the effect of extending the model’s context window during continual pretraining on its ability to handle long-context tasks. To support our evaluation, we introduce INDAQA, a new benchmark for narrative question answering in Italian. Our results reveal that both data composition and increased context length substantially improve performance, offering valuable insights into continual pretraining strategies for less represented languages within an open scientific framework.

What We Learned from Continually Training Minerva: A Case Study on Italian / Moroni, L., Bonomo, T., Gioffre, L., Xu, L.u., Fedele, D., Colosi, L., Bejgu, A.S., Scire, A., Navigli, R.. - (2025), pp. 760-774. (the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025) Cagliari, Italia ).

What We Learned from Continually Training Minerva: A Case Study on Italian

Luca Moroni;Tommaso Bonomo;Luca Gioffre;Lu Xu;Domenico Fedele;Leonardo Colosi;Andrei Stefan Bejgu;Alessandro Scire;Roberto Navigli

2025

Abstract

Modern Large Language Models (LLMs) are commonly trained through a multi-stage pipeline encompassing pretraining and supervised finetuning. While recent studies have extensively investigated the benefits of continual pretraining on high-quality data, these efforts have focused primarily on English. In this work, we explore the effectiveness of various data mixtures in a continual pretraining setting to enhance performance on Italian-language tasks. Leveraging Minerva-7B, a fully opensource LLM pretrained on a corpus composed of 50% Italian, we define and evaluate three distinct data recipes–comprising mathematical, encyclopedic, and copyrighted content–spanning both Italian and English. We also investigate the effect of extending the model’s context window during continual pretraining on its ability to handle long-context tasks. To support our evaluation, we introduce INDAQA, a new benchmark for narrative question answering in Italian. Our results reveal that both data composition and increased context length substantially improve performance, offering valuable insights into continual pretraining strategies for less represented languages within an open scientific framework.

Scheda breve

Scheda completa

	Anno di pubblicazione
	
				2025
			
	Nome convegno
	
				the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025)
			
	Parole chiave
	
				LLMs, Continual-Training, Evaluation, Culturality
			
	Tipologia
	
				04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
			
	Citazione
	
				What We Learned from Continually Training Minerva: A Case Study on Italian / Moroni, L., Bonomo, T., Gioffre, L., Xu, L.u., Fedele, D., Colosi, L., Bejgu, A.S., Scire, A., Navigli, R.. - (2025), pp. 760-774. (the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025) Cagliari, Italia ).

File allegati a questo prodotto

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1768951

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

1

ND

social impact