Catalogo dei prodotti della ricerca

Recent Large Language Models (LLMs) have shown impressive performance in addressing complex aspects of human language. These models have also demonstrated significant capabilities in processing and generating Italian text, achieving state-ofthe-art results on current benchmarks for the Italian language. However, the number and quality of such benchmarks is still insufficient. A case in point is the “Open Ita LLM Leaderboard” which only supports three benchmarks, despite being one of the most popular evaluation suite for the evaluation of Italian-language LLMs. In this paper, we analyze the current limitations of existing evaluation suites and propose two ways of addressing this gap: i) a new suite of automatically-translated benchmarks, drawn from the most popular English benchmarks; and ii) the adaptation of existing manual datasets so that they can be used to complement the evaluation of Italian LLMs. We discuss the pros and cons of both approaches, releasing our data to foster further research on the evaluation of Italian-language LLMs.

ITA-Bench: Towards a More Comprehensive Evaluation for Italian LLMs / Moroni, Luca; Conia, Simone; Martelli, Federico; Navigli, Roberto. - (2024). (Intervento presentato al convegno Italian Conference on Computational Linguistics (CLiC-it 2024) tenutosi a Pisa; Italy).

ITA-Bench: Towards a More Comprehensive Evaluation for Italian LLMs

Luca Moroni^{Co-primo

Resources};Simone Conia^{Co-primo

Resources};Federico Martelli^{Penultimo

Resources};Roberto Navigli^{Ultimo

Supervision}

2024

Abstract

Recent Large Language Models (LLMs) have shown impressive performance in addressing complex aspects of human language. These models have also demonstrated significant capabilities in processing and generating Italian text, achieving state-ofthe-art results on current benchmarks for the Italian language. However, the number and quality of such benchmarks is still insufficient. A case in point is the “Open Ita LLM Leaderboard” which only supports three benchmarks, despite being one of the most popular evaluation suite for the evaluation of Italian-language LLMs. In this paper, we analyze the current limitations of existing evaluation suites and propose two ways of addressing this gap: i) a new suite of automatically-translated benchmarks, drawn from the most popular English benchmarks; and ii) the adaptation of existing manual datasets so that they can be used to complement the evaluation of Italian LLMs. We discuss the pros and cons of both approaches, releasing our data to foster further research on the evaluation of Italian-language LLMs.

Scheda breve

Scheda completa

	Anno di pubblicazione
	
				2024
			
	Nome convegno
	
				Italian Conference on Computational Linguistics (CLiC-it 2024)
			
	Parole chiave
	
				Large Language Models; Natural Language Processing; Evaluation; Italian Language
			
	Tipologia
	
				04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
			
	Citazione
	
				ITA-Bench: Towards a More Comprehensive Evaluation for Italian LLMs / Moroni, Luca; Conia, Simone; Martelli, Federico; Navigli, Roberto. - (2024). (Intervento presentato al  convegno Italian Conference on Computational Linguistics (CLiC-it 2024) tenutosi a Pisa; Italy).
			
	Appartiene alla tipologia:
	
				04b Atto di convegno in volume

File allegati a questo prodotto

File	Dimensione	Formato
Moroni_ITA-Bench_2024.pdf accesso aperto Note: https://ceur-ws.org/Vol-3878/66_main_long.pdf Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore) Licenza: Creative commons Dimensione 319.99 kB Formato Adobe PDF	319.99 kB	Adobe PDF

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1727996

Citazioni

ND

ND

ND

social impact