The number of pretrained Large Language Models (LLMs) is increasing steadily, though the majority are designed predominantly for the English language. While state-of-the-art LLMs can handle other languages, due to language contamination or some degree of multilingual pretraining data, they are not optimized for non-English languages, leading to inefficient encoding (high token "fertility") and slower inference speed. In this work, we thoroughly compare a variety of vocabulary adaptation techniques for optimizing English LLMs for the Italian language, and put forward Semantic Alignment Vocabulary Adaptation (SAVA), a novel method that leverages neural mapping for vocabulary substitution. SAVA achieves competitive performance across multiple downstream tasks, enhancing grounded alignment strategies. We adapt two LLMs: Mistral-7b-v0.1, reducing token fertility by 25%, and Llama-3.1-8B, optimizing the vocabulary and reducing the number of parameters by 1 billion. We show that, following the adaptation of the vocabulary, these models can recover their performance with a relatively limited stage of continual training on the target language. Finally, we test the capabilities of the adapted models on various multi-choice and generative tasks.

Optimizing LLMs for Italian: Reducing Token Fertility and Enhancing Efficiency Through Vocabulary Adaptation / Moroni, Luca; Puccetti, Giovanni; Huguet Cabot, PERE-LLUIS; Bejgu, Andrei Stefan; Miaschi, Alessio; Barba, Edoardo; Dell'Orletta, Felice; Esuli, Andrea; Navigli, Roberto. - (2025), pp. 6646-6660. (Intervento presentato al convegno NAACL 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics tenutosi a Albuquerque (New Mexico); United States of America) [10.18653/v1/2025.findings-naacl.371].

Optimizing LLMs for Italian: Reducing Token Fertility and Enhancing Efficiency Through Vocabulary Adaptation

Luca Moroni
;
Pere-Lluis Huguet Cabot;Andrei Stefan Bejgu;Edoardo Barba;Roberto Navigli
2025

Abstract

The number of pretrained Large Language Models (LLMs) is increasing steadily, though the majority are designed predominantly for the English language. While state-of-the-art LLMs can handle other languages, due to language contamination or some degree of multilingual pretraining data, they are not optimized for non-English languages, leading to inefficient encoding (high token "fertility") and slower inference speed. In this work, we thoroughly compare a variety of vocabulary adaptation techniques for optimizing English LLMs for the Italian language, and put forward Semantic Alignment Vocabulary Adaptation (SAVA), a novel method that leverages neural mapping for vocabulary substitution. SAVA achieves competitive performance across multiple downstream tasks, enhancing grounded alignment strategies. We adapt two LLMs: Mistral-7b-v0.1, reducing token fertility by 25%, and Llama-3.1-8B, optimizing the vocabulary and reducing the number of parameters by 1 billion. We show that, following the adaptation of the vocabulary, these models can recover their performance with a relatively limited stage of continual training on the target language. Finally, we test the capabilities of the adapted models on various multi-choice and generative tasks.
2025
NAACL 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics
llms; vocabulary adaptation; italian
04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
Optimizing LLMs for Italian: Reducing Token Fertility and Enhancing Efficiency Through Vocabulary Adaptation / Moroni, Luca; Puccetti, Giovanni; Huguet Cabot, PERE-LLUIS; Bejgu, Andrei Stefan; Miaschi, Alessio; Barba, Edoardo; Dell'Orletta, Felice; Esuli, Andrea; Navigli, Roberto. - (2025), pp. 6646-6660. (Intervento presentato al convegno NAACL 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics tenutosi a Albuquerque (New Mexico); United States of America) [10.18653/v1/2025.findings-naacl.371].
File allegati a questo prodotto
File Dimensione Formato  
Moroni_Optimizing-LLMs_2025.pdf

accesso aperto

Note: https://aclanthology.org/2025.findings-naacl.371.pdf
Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza: Creative commons
Dimensione 905 kB
Formato Adobe PDF
905 kB Adobe PDF

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1739586
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact