LiteraryQA: Towards Effective Evaluation of Long-document Narrative QA

Bonomo, Tommaso; Gioffre, Luca; Navigli, Roberto

doi:10.18653/v1/2025.emnlp-main.1729

Question Answering (QA) on narrative text poses a unique challenge to current systems, requiring a deep understanding of long, complex documents. However, the reliability of NarrativeQA, the most widely used benchmark in this domain, is hindered by noisy documents and flawed QA pairs. In this work, we introduce LiteraryQA, a high-quality subset of NarrativeQA focused on literary works. Using a human- and LLM-validated pipeline, we identify and correct low-quality QA samples while removing extraneous text from source documents. We then carry out a meta-evaluation of automatic metrics to clarify how systems should be evaluated on LiteraryQA.This analysis reveals that all n-gram-based metrics have a low system-level correlation to human judgment, while LLM-as-a-Judge evaluations, even with small open-weight models, can strongly agree with the ranking identified by humans.Finally, we benchmark a set of long-context LLMs on LiteraryQA. We release our code and data at https://github.com/sapienzaNLP/LiteraryQA.

LiteraryQA: Towards Effective Evaluation of Long-document Narrative QA / Bonomo, Tommaso; Gioffre, Luca; Navigli, Roberto. - Volume 1: Long Papers:(2025), pp. 34074-34095. (Intervento presentato al convegno Empirical Methods in Natural Language Processing tenutosi a Suzhou; China) [10.18653/v1/2025.emnlp-main.1729].

LiteraryQA: Towards Effective Evaluation of Long-document Narrative QA

Tommaso Bonomo;Luca Gioffre;Roberto Navigli

2025

Abstract

Question Answering (QA) on narrative text poses a unique challenge to current systems, requiring a deep understanding of long, complex documents. However, the reliability of NarrativeQA, the most widely used benchmark in this domain, is hindered by noisy documents and flawed QA pairs. In this work, we introduce LiteraryQA, a high-quality subset of NarrativeQA focused on literary works. Using a human- and LLM-validated pipeline, we identify and correct low-quality QA samples while removing extraneous text from source documents. We then carry out a meta-evaluation of automatic metrics to clarify how systems should be evaluated on LiteraryQA.This analysis reveals that all n-gram-based metrics have a low system-level correlation to human judgment, while LLM-as-a-Judge evaluations, even with small open-weight models, can strongly agree with the ranking identified by humans.Finally, we benchmark a set of long-context LLMs on LiteraryQA. We release our code and data at https://github.com/sapienzaNLP/LiteraryQA.

Scheda breve

Scheda completa

	Anno di pubblicazione
	
				2025
			
	Nome convegno
	
				Empirical Methods in Natural Language Processing
			
	Parole chiave
	
				question answering; benchmark; narrative; evaluation methodologies
			
	Tipologia
	
				04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
			
	Citazione
	
				LiteraryQA: Towards Effective Evaluation of Long-document Narrative QA / Bonomo, Tommaso; Gioffre, Luca; Navigli, Roberto. - Volume 1: Long Papers:(2025), pp. 34074-34095. (Intervento presentato al  convegno Empirical Methods in Natural Language Processing tenutosi a Suzhou; China) [10.18653/v1/2025.emnlp-main.1729].

File allegati a questo prodotto

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1755709

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

ND

ND

Catalogo dei prodotti della ricerca