With state-of-the-art systems having finally attained estimated human performance, Word Sense Disambiguation (WSD) has now joined the array of Natural Language Processing tasks that have seemingly been solved, thanks to the vast amounts of knowledge encoded into Transformer-based pre-trained language models. And yet, if we look below the surface of raw figures, it is easy to realize that current approaches still make trivial mistakes that a human would never make. In this work, we provide evidence showing why the F1 score metric should not simply be taken at face value and present an exhaustive analysis of the errors that seven of the most representative state-of-the-art systems for English all-words WSD make on traditional evaluation benchmarks.In addition, we produce and release a collection of test sets featuring (a) an amended version of the standard evaluation benchmark that fixes its lexical and semantic inaccuracies, (b) 42D, a challenge set devised to assess the resilience of systems with respect to least frequent word senses and senses not seen at training time, and (c) hardEN, a challenge set made up solely of instances which none of the investigated state-of-the-art systems can solve. We make all of the test sets and model predictions available to the research community at https://github.com/SapienzaNLP/wsd-hard-benchmark.

Nibbling at the Hard Core of Word Sense Disambiguation / Maru, Marco; Conia, Simone; Bevilacqua, Michele; Navigli, Roberto. - 1:(2022), pp. 4724-4737. (Intervento presentato al convegno Association for Computational Linguistics tenutosi a Dublin; Ireland) [10.18653/v1/2022.acl-long.324].

Nibbling at the Hard Core of Word Sense Disambiguation

Conia, Simone;Navigli, Roberto
Ultimo
2022

Abstract

With state-of-the-art systems having finally attained estimated human performance, Word Sense Disambiguation (WSD) has now joined the array of Natural Language Processing tasks that have seemingly been solved, thanks to the vast amounts of knowledge encoded into Transformer-based pre-trained language models. And yet, if we look below the surface of raw figures, it is easy to realize that current approaches still make trivial mistakes that a human would never make. In this work, we provide evidence showing why the F1 score metric should not simply be taken at face value and present an exhaustive analysis of the errors that seven of the most representative state-of-the-art systems for English all-words WSD make on traditional evaluation benchmarks.In addition, we produce and release a collection of test sets featuring (a) an amended version of the standard evaluation benchmark that fixes its lexical and semantic inaccuracies, (b) 42D, a challenge set devised to assess the resilience of systems with respect to least frequent word senses and senses not seen at training time, and (c) hardEN, a challenge set made up solely of instances which none of the investigated state-of-the-art systems can solve. We make all of the test sets and model predictions available to the research community at https://github.com/SapienzaNLP/wsd-hard-benchmark.
2022
Association for Computational Linguistics
word sense disambiguation; semantics; natural language processing; benchmark
04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
Nibbling at the Hard Core of Word Sense Disambiguation / Maru, Marco; Conia, Simone; Bevilacqua, Michele; Navigli, Roberto. - 1:(2022), pp. 4724-4737. (Intervento presentato al convegno Association for Computational Linguistics tenutosi a Dublin; Ireland) [10.18653/v1/2022.acl-long.324].
File allegati a questo prodotto
File Dimensione Formato  
Muru_Nibbling_2022.pdf

accesso aperto

Note: Link alla pubblicazione: https://aclanthology.org/2022.acl-long.324/
Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza: Creative commons
Dimensione 358.46 kB
Formato Adobe PDF
358.46 kB Adobe PDF

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1639906
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 18
  • ???jsp.display-item.citation.isi??? 5
social impact