Catalogo dei prodotti della ricerca

As large language models (LLMs) continue to improve, their evaluation increasingly centers on complex, high-level tasks, often at the expense of systematically assessing fundamental capabilities. To address this gap, recent work proposed LMentry, a compact benchmark comprising tasks that are trivial for humans but remain surprisingly difficult for LLMs. However, LMentry is limited to English, leaving its insights linguistically narrow. In this paper, we present Multi-LMentry, a ground-up recreation of LMentry that enables systematic evaluation of LLMs on basic reasoning and understanding tasks across nine diverse languages. Multi-LMentry includes English and expands to Basque, Brazilian Portuguese, Catalan, Galician, German, Italian, Korean, and Spanish, emphasizing the importance of cross-lingual and low-resource settings. To validate that Multi-LMentry is still trivial for humans, we demonstrate that L2 speakers with only elementary proficiency achieve near-perfect scores in a low-resource language, namely, Basque. Through extensive experiments, we reveal that state-of-the-art open-weight multilingual LLMs still fall short of human performance on elementary tasks in many languages. Our results expose new failure modes that remain hidden in monolingual evaluation, underscoring the need for rigorous, language-diverse ``unit tests'' of core model abilities.

Multi-{LM}entry: Can Multilingual {LLM}s Solve Elementary Tasks Across Languages? / Moroni, Luca; Aula-Blasco, Javier; Conia, Simone; Baucells, Irene; Perez, Naiara; Suàrez Silvia, Paniagua; Sallès, Anna; Ostendorff, Malte; Falcào, Jùlia; Son, Guijin; Gonzalez-Agirre, Aitor; Navigli, Roberto; Villegas Montserrat, Marta. - (2025), pp. 34126-34157. ( Conference on Empirical Methods in Natural Language Processing Suzhou; China ) [10.18653/v1/2025.emnlp-main.1731].

Multi-{LM}entry: Can Multilingual {LLM}s Solve Elementary Tasks Across Languages?

Moroni Luca;Aula-Blasco Javier;Conia Simone;Baucells Irene;Perez Naiara;Suàrez Silvia Paniagua;Sallès Anna;Ostendorff Malte;Falcào Jùlia;Son Guijin;Gonzalez-Agirre Aitor;Navigli Roberto;Villegas Marta

2025

Abstract

As large language models (LLMs) continue to improve, their evaluation increasingly centers on complex, high-level tasks, often at the expense of systematically assessing fundamental capabilities. To address this gap, recent work proposed LMentry, a compact benchmark comprising tasks that are trivial for humans but remain surprisingly difficult for LLMs. However, LMentry is limited to English, leaving its insights linguistically narrow. In this paper, we present Multi-LMentry, a ground-up recreation of LMentry that enables systematic evaluation of LLMs on basic reasoning and understanding tasks across nine diverse languages. Multi-LMentry includes English and expands to Basque, Brazilian Portuguese, Catalan, Galician, German, Italian, Korean, and Spanish, emphasizing the importance of cross-lingual and low-resource settings. To validate that Multi-LMentry is still trivial for humans, we demonstrate that L2 speakers with only elementary proficiency achieve near-perfect scores in a low-resource language, namely, Basque. Through extensive experiments, we reveal that state-of-the-art open-weight multilingual LLMs still fall short of human performance on elementary tasks in many languages. Our results expose new failure modes that remain hidden in monolingual evaluation, underscoring the need for rigorous, language-diverse ``unit tests'' of core model abilities.

Scheda breve

Scheda completa

	Anno di pubblicazione
	
				2025
			
	Nome convegno
	
				Conference on Empirical Methods in Natural Language Processing
			
	Parole chiave
	
				LLM; Evaluation; Elementary Tasks
			
	Tipologia
	
				04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
			
	Citazione
	
				Multi-{LM}entry: Can Multilingual {LLM}s Solve Elementary Tasks Across Languages? / Moroni, Luca; Aula-Blasco, Javier; Conia, Simone; Baucells, Irene; Perez, Naiara; Suàrez Silvia, Paniagua; Sallès, Anna; Ostendorff, Malte; Falcào, Jùlia; Son, Guijin; Gonzalez-Agirre, Aitor; Navigli, Roberto; Villegas Montserrat, Marta. - (2025), pp. 34126-34157. ( Conference on Empirical Methods in Natural Language Processing Suzhou; China ) [10.18653/v1/2025.emnlp-main.1731].
			
	Appartiene alla tipologia:
	
				04b Atto di convegno in volume

File allegati a questo prodotto

File	Dimensione	Formato
Moroni_Can-multilingual_2025.pdf accesso aperto Note: DOI: 10.18653/v1/2025.emnlp-main.1731 Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore) Licenza: Tutti i diritti riservati (All rights reserved) Dimensione 471.04 kB Formato Adobe PDF	471.04 kB	Adobe PDF

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1760685

Citazioni

ND

ND

ND

social impact