Modern Large Language Models (LLMs) are commonly trained through a multi-stage pipeline encompassing pretraining and supervised finetuning. While recent studies have extensively investigated the benefits of continual pretraining on high-quality data, these efforts have focused primarily on English. In this work, we explore the effectiveness of various data mixtures in a continual pretraining setting to enhance performance on Italian-language tasks. Leveraging Minerva-7B, a fully opensource LLM pretrained on a corpus composed of 50% Italian, we define and evaluate three distinct data recipes–comprising mathematical, encyclopedic, and copyrighted content–spanning both Italian and English. We also investigate the effect of extending the model’s context window during continual pretraining on its ability to handle long-context tasks. To support our evaluation, we introduce INDAQA, a new benchmark for narrative question answering in Italian. Our results reveal that both data composition and increased context length substantially improve performance, offering valuable insights into continual pretraining strategies for less represented languages within an open scientific framework.

What We Learned from Continually Training Minerva: A Case Study on Italian / Moroni, Luca; Bonomo, Tommaso; Gioffre, Luca; Xu, Lu; Fedele, Domenico; Colosi, Leonardo; Bejgu, Andrei Stefan; Scire, Alessandro; Navigli, Roberto. - (2025), pp. 760-774. ( the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025) Cagliari, Italia ).

What We Learned from Continually Training Minerva: A Case Study on Italian

Luca Moroni;Tommaso Bonomo;Luca Gioffre;Lu Xu;Leonardo Colosi;Andrei Stefan Bejgu;Alessandro Scire;Roberto Navigli
2025

Abstract

Modern Large Language Models (LLMs) are commonly trained through a multi-stage pipeline encompassing pretraining and supervised finetuning. While recent studies have extensively investigated the benefits of continual pretraining on high-quality data, these efforts have focused primarily on English. In this work, we explore the effectiveness of various data mixtures in a continual pretraining setting to enhance performance on Italian-language tasks. Leveraging Minerva-7B, a fully opensource LLM pretrained on a corpus composed of 50% Italian, we define and evaluate three distinct data recipes–comprising mathematical, encyclopedic, and copyrighted content–spanning both Italian and English. We also investigate the effect of extending the model’s context window during continual pretraining on its ability to handle long-context tasks. To support our evaluation, we introduce INDAQA, a new benchmark for narrative question answering in Italian. Our results reveal that both data composition and increased context length substantially improve performance, offering valuable insights into continual pretraining strategies for less represented languages within an open scientific framework.
2025
the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025)
LLMs, Continual-Training, Evaluation, Culturality
04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
What We Learned from Continually Training Minerva: A Case Study on Italian / Moroni, Luca; Bonomo, Tommaso; Gioffre, Luca; Xu, Lu; Fedele, Domenico; Colosi, Leonardo; Bejgu, Andrei Stefan; Scire, Alessandro; Navigli, Roberto. - (2025), pp. 760-774. ( the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025) Cagliari, Italia ).
File allegati a questo prodotto
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1768951
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? ND
social impact