XL-WiC: A Multilingual Benchmark for Evaluating Semantic Contextualization

Raganato, Alessandro; Pasini, Tommaso; Camacho-Collados, Jose; Pilehvar, Mohammad Taher

doi:10.18653/v1/2020.emnlp-main.584

The ability to correctly model distinct meanings of a word is crucial for the effectiveness of semantic representation techniques. However, most existing evaluation benchmarks for assessing this criterion are tied to sense inventories (usually WordNet), restricting their usage to a small subset of knowledge-based representation techniques. The Word-in-Context dataset (WiC) addresses the dependence on sense inventories by reformulating the standard disambiguation task as a binary classification problem; but, it is limited to the English language. We put forward a large multilingual benchmark, XL-WiC, featuring gold standards in 12 new languages from varied language families and with different degrees of resource availability, opening room for evaluation scenarios such as zero-shot cross-lingual transfer. We perform a series of experiments to determine the reliability of the datasets and to set performance baselines for several recent contextualized multilingual models. Experimental results show that even when no tagged instances are available for a target language, models trained solely on the English data can attain competitive performance in the task of distinguishing different meanings of a word, even for distant languages. XL-WiC is available at https://pilehvar.github.io/xlwic/.

XL-WiC: A Multilingual Benchmark for Evaluating Semantic Contextualization / Raganato, Alessandro; Pasini, Tommaso; Camacho-Collados, Jose; Pilehvar, Mohammad Taher. - (2020), pp. 7193-7206. (Intervento presentato al convegno the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) tenutosi a Online, online) [10.18653/v1/2020.emnlp-main.584].

XL-WiC: A Multilingual Benchmark for Evaluating Semantic Contextualization

Raganato, Alessandro;Pasini, Tommaso;Camacho-Collados, Jose;Pilehvar, Mohammad Taher

2020

Abstract

The ability to correctly model distinct meanings of a word is crucial for the effectiveness of semantic representation techniques. However, most existing evaluation benchmarks for assessing this criterion are tied to sense inventories (usually WordNet), restricting their usage to a small subset of knowledge-based representation techniques. The Word-in-Context dataset (WiC) addresses the dependence on sense inventories by reformulating the standard disambiguation task as a binary classification problem; but, it is limited to the English language. We put forward a large multilingual benchmark, XL-WiC, featuring gold standards in 12 new languages from varied language families and with different degrees of resource availability, opening room for evaluation scenarios such as zero-shot cross-lingual transfer. We perform a series of experiments to determine the reliability of the datasets and to set performance baselines for several recent contextualized multilingual models. Experimental results show that even when no tagged instances are available for a target language, models trained solely on the English data can attain competitive performance in the task of distinguishing different meanings of a word, even for distant languages. XL-WiC is available at https://pilehvar.github.io/xlwic/.

Scheda breve

Scheda completa

	Anno di pubblicazione
	
				2020
			
	Nome convegno
	
				the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
			
	Parole chiave
	
				word sense disambiguation; neural networks; deep learning, multilinguality
			
	Tipologia
	
				04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
			
	Citazione
	
				XL-WiC: A Multilingual Benchmark for Evaluating Semantic Contextualization / Raganato, Alessandro; Pasini, Tommaso; Camacho-Collados, Jose; Pilehvar, Mohammad Taher. - (2020), pp. 7193-7206. (Intervento presentato al  convegno the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) tenutosi a Online, online) [10.18653/v1/2020.emnlp-main.584].
			
	Appartiene alla tipologia:
	
				04b Atto di convegno in volume

File allegati a questo prodotto

Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1550435

Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni

ND

41

ND

Catalogo dei prodotti della ricerca