Catalogo dei prodotti della ricerca

One of the most widely used tasks for evaluating Large Language Models (LLMs) is Multiple-Choice Question Answering (MCQA). While open-ended question answering tasks are more challenging to evaluate, MCQA tasks are, in principle, easier to assess, as the model{'}s answer is thought to be simple to extract and is compared directly to a set of predefined choices. However, recent studies have started to question the reliability of MCQA evaluation, showing that multiple factors can significantly impact the reported performance of LLMs, especially when the model generates free-form text before selecting one of the answer choices. In this work, we shed light on the inconsistencies of MCQA evaluation strategies, which can lead to inaccurate and misleading model comparisons. We systematically analyze whether existing answer extraction methods are aligned with human judgment, and how they are influenced by answer constraints in the prompt across different domains. Our experiments demonstrate that traditional evaluation strategies often underestimate LLM capabilities, while LLM-based answer extractors are prone to systematic errors. Moreover, we reveal a fundamental trade-off between including format constraints in the prompt to simplify answer extraction and allowing models to generate free-form text to improve reasoning. Our findings call for standardized evaluation methodologies and highlight the need for more reliable and consistent MCQA evaluation practices.

Right Answer, Wrong Score: Uncovering the Inconsistencies of {LLM} Evaluation in Multiple-Choice Question Answering / Molfese, Francesco Maria; Moroni, Luca; Gioffre', Luca; Scirè, Alessandro; Conia, Simone; Navigli, Roberto. - (2025), pp. 18477-18494. ( Association for Computational Linguistics Vienna, Austria ) [10.18653/v1/2025.findings-acl.950].

Right Answer, Wrong Score: Uncovering the Inconsistencies of {LLM} Evaluation in Multiple-Choice Question Answering

Molfese Francesco Maria;Moroni Luca;Gioffré Luca;Scirè Alessandro;Conia Simone;Navigli Roberto

2025

Abstract

One of the most widely used tasks for evaluating Large Language Models (LLMs) is Multiple-Choice Question Answering (MCQA). While open-ended question answering tasks are more challenging to evaluate, MCQA tasks are, in principle, easier to assess, as the model{'}s answer is thought to be simple to extract and is compared directly to a set of predefined choices. However, recent studies have started to question the reliability of MCQA evaluation, showing that multiple factors can significantly impact the reported performance of LLMs, especially when the model generates free-form text before selecting one of the answer choices. In this work, we shed light on the inconsistencies of MCQA evaluation strategies, which can lead to inaccurate and misleading model comparisons. We systematically analyze whether existing answer extraction methods are aligned with human judgment, and how they are influenced by answer constraints in the prompt across different domains. Our experiments demonstrate that traditional evaluation strategies often underestimate LLM capabilities, while LLM-based answer extractors are prone to systematic errors. Moreover, we reveal a fundamental trade-off between including format constraints in the prompt to simplify answer extraction and allowing models to generate free-form text to improve reasoning. Our findings call for standardized evaluation methodologies and highlight the need for more reliable and consistent MCQA evaluation practices.

Scheda breve

Scheda completa

	Anno di pubblicazione
	
				2025
			
	Nome convegno
	
				Association for Computational Linguistics
			
	Parole chiave
	
				LLM, Evaluation, MCQA
			
	Tipologia
	
				04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
			
	Citazione
	
				Right Answer, Wrong Score: Uncovering the Inconsistencies of {LLM} Evaluation in Multiple-Choice Question Answering / Molfese, Francesco Maria; Moroni, Luca; Gioffre', Luca; Scirè, Alessandro; Conia, Simone; Navigli, Roberto. - (2025), pp. 18477-18494. ( Association for Computational Linguistics Vienna, Austria ) [10.18653/v1/2025.findings-acl.950].
			
	Appartiene alla tipologia:
	
				04b Atto di convegno in volume

File allegati a questo prodotto

File	Dimensione	Formato
Molfese_Right-Answer_2025.pdf accesso aperto Note: DOI: 10.18653/v1/2025.findings-acl.950 Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore) Licenza: Creative commons Dimensione 725.28 kB Formato Adobe PDF	725.28 kB	Adobe PDF

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1744349

Citazioni

ND

0

ND

social impact