The growing interest in Large Language Models (LLMs) has accelerated research efforts to adapt these models for various languages. Despite this, pretraining LLMs from scratch for non-English languages remains underexplored. This is the case for Italian, where no truly open-source research has investigated the pretraining process. To address this gap, we introduce Minerva (https://nlp.uniroma1.it/minerva), the first family of LLMs trained entirely from scratch on native Italian texts. Our work is the first investigation into the challenges and opportunities of pretraining LLMs specifically for the Italian language, offering insights into vocabulary design, data composition, and model development. With Minerva, we demonstrate that building an LLM tailored to a specific language yields numerous practical benefits over adapting existing multilingual models, including greater control over the model’s vocabulary and the composition of its training data. We provide an overview of the design choices, pretraining methods, and evaluation metrics used to develop Minerva, which shows promising performance on Italian benchmarks and downstream tasks. Moreover, we share the lessons learned throughout Minerva’s development to support the academic and industrial communities in advancing non-English LLM research. We believe that Minerva serves as an important step towards closing the gap in high-quality, open-source LLMs for non-English languages.

Minerva LLMs: The First Family of Large Language Models Trained from Scratch on Italian Data / Orlando, Riccardo; Moroni, Luca; Huguet Cabot, Pere-Lluís; Barba, Edoardo; Conia, Simone; Orlandini, Sergio; Fiameni, Giuseppe; Navigli, Roberto. - (2024). (Intervento presentato al convegno Italian Conference on Computational Linguistics tenutosi a Pisa; Italy).

Minerva LLMs: The First Family of Large Language Models Trained from Scratch on Italian Data

Riccardo Orlando;Luca Moroni;Edoardo Barba;Simone Conia;Roberto Navigli
2024

Abstract

The growing interest in Large Language Models (LLMs) has accelerated research efforts to adapt these models for various languages. Despite this, pretraining LLMs from scratch for non-English languages remains underexplored. This is the case for Italian, where no truly open-source research has investigated the pretraining process. To address this gap, we introduce Minerva (https://nlp.uniroma1.it/minerva), the first family of LLMs trained entirely from scratch on native Italian texts. Our work is the first investigation into the challenges and opportunities of pretraining LLMs specifically for the Italian language, offering insights into vocabulary design, data composition, and model development. With Minerva, we demonstrate that building an LLM tailored to a specific language yields numerous practical benefits over adapting existing multilingual models, including greater control over the model’s vocabulary and the composition of its training data. We provide an overview of the design choices, pretraining methods, and evaluation metrics used to develop Minerva, which shows promising performance on Italian benchmarks and downstream tasks. Moreover, we share the lessons learned throughout Minerva’s development to support the academic and industrial communities in advancing non-English LLM research. We believe that Minerva serves as an important step towards closing the gap in high-quality, open-source LLMs for non-English languages.
2024
Italian Conference on Computational Linguistics
natural language processing; artificial intelligence; large language models; multilingual NLP
04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
Minerva LLMs: The First Family of Large Language Models Trained from Scratch on Italian Data / Orlando, Riccardo; Moroni, Luca; Huguet Cabot, Pere-Lluís; Barba, Edoardo; Conia, Simone; Orlandini, Sergio; Fiameni, Giuseppe; Navigli, Roberto. - (2024). (Intervento presentato al convegno Italian Conference on Computational Linguistics tenutosi a Pisa; Italy).
File allegati a questo prodotto
File Dimensione Formato  
Orlando_Minerva_2024.pdf

accesso aperto

Note: https://ceur-ws.org/Vol-3878/76_main_long.pdf
Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza: Creative commons
Dimensione 386 kB
Formato Adobe PDF
386 kB Adobe PDF

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1728121
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact