Catalogo dei prodotti della ricerca

The number of pretrained Large Language Models (LLMs) is increasing steadily, though the majority are designed predominantly for the English language. While state-of-the-art LLMs can handle other languages, due to language contamination or some degree of multilingual pretraining data, they are not optimized for non-English languages, leading to inefficient encoding (high token "fertility") and slower inference speed. In this work, we thoroughly compare a variety of vocabulary adaptation techniques for optimizing English LLMs for the Italian language, and put forward Semantic Alignment Vocabulary Adaptation (SAVA), a novel method that leverages neural mapping for vocabulary substitution. SAVA achieves competitive performance across multiple downstream tasks, enhancing grounded alignment strategies. We adapt two LLMs: Mistral-7b-v0.1, reducing token fertility by 25%, and Llama-3.1-8B, optimizing the vocabulary and reducing the number of parameters by 1 billion. We show that, following the adaptation of the vocabulary, these models can recover their performance with a relatively limited stage of continual training on the target language. Finally, we test the capabilities of the adapted models on various multi-choice and generative tasks.

Optimizing LLMs for Italian: Reducing Token Fertility and Enhancing Efficiency Through Vocabulary Adaptation / Moroni, Luca; Puccetti, Giovanni; Huguet Cabot, PERE-LLUIS; Bejgu, Andrei Stefan; Miaschi, Alessio; Barba, Edoardo; Dell'Orletta, Felice; Esuli, Andrea; Navigli, Roberto. - (2025), pp. 6646-6660. (Intervento presentato al convegno NAACL 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics tenutosi a Albuquerque (New Mexico); United States of America) [10.18653/v1/2025.findings-naacl.371].

Optimizing LLMs for Italian: Reducing Token Fertility and Enhancing Efficiency Through Vocabulary Adaptation

Luca Moroni;Giovanni Puccetti;Pere-Lluis Huguet Cabot;Andrei Stefan Bejgu;Alessio Miaschi;Edoardo Barba;Felice Dell'Orletta;Andrea Esuli;Roberto Navigli

2025

Abstract

The number of pretrained Large Language Models (LLMs) is increasing steadily, though the majority are designed predominantly for the English language. While state-of-the-art LLMs can handle other languages, due to language contamination or some degree of multilingual pretraining data, they are not optimized for non-English languages, leading to inefficient encoding (high token "fertility") and slower inference speed. In this work, we thoroughly compare a variety of vocabulary adaptation techniques for optimizing English LLMs for the Italian language, and put forward Semantic Alignment Vocabulary Adaptation (SAVA), a novel method that leverages neural mapping for vocabulary substitution. SAVA achieves competitive performance across multiple downstream tasks, enhancing grounded alignment strategies. We adapt two LLMs: Mistral-7b-v0.1, reducing token fertility by 25%, and Llama-3.1-8B, optimizing the vocabulary and reducing the number of parameters by 1 billion. We show that, following the adaptation of the vocabulary, these models can recover their performance with a relatively limited stage of continual training on the target language. Finally, we test the capabilities of the adapted models on various multi-choice and generative tasks.

Scheda breve

Scheda completa

	Anno di pubblicazione
	
				2025
			
	Nome convegno
	
				NAACL 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics
			
	Parole chiave
	
				llms; vocabulary adaptation; italian
			
	Tipologia
	
				04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
			
	Citazione
	
				Optimizing LLMs for Italian: Reducing Token Fertility and Enhancing Efficiency Through Vocabulary Adaptation / Moroni, Luca; Puccetti, Giovanni; Huguet Cabot, PERE-LLUIS; Bejgu, Andrei Stefan; Miaschi, Alessio; Barba, Edoardo; Dell'Orletta, Felice; Esuli, Andrea; Navigli, Roberto. - (2025), pp. 6646-6660. (Intervento presentato al  convegno NAACL 2025 Annual Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics tenutosi a Albuquerque (New Mexico); United States of America) [10.18653/v1/2025.findings-naacl.371].
			
	Appartiene alla tipologia:
	
				04b Atto di convegno in volume

File allegati a questo prodotto

File	Dimensione	Formato
Moroni_Optimizing-LLMs_2025.pdf accesso aperto Note: https://aclanthology.org/2025.findings-naacl.371.pdf Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore) Licenza: Creative commons Dimensione 905 kB Formato Adobe PDF	905 kB	Adobe PDF

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1739586

Citazioni

ND

ND

ND

social impact