Catalogo dei prodotti della ricerca

The knowledge acquisition bottleneck problem dramatically hampers the creation of sense-annotated data for Word Sense Disambiguation (WSD). Sense-annotated data are scarce for English and almost absent for other languages. This limits the range of action of deep-learning approaches, which today are at the base of any NLP task and are hungry for data. We mitigate this issue and encourage further research in multilingual WSD by releasing to the NLP community five large datasets annotated with word-senses in five different languages, namely, English, French, Italian, German and Spanish, and 5 distinct datasets in English, each for a different semantic domain. We show that supervised WSD models trained on our data attain higher performance than when trained on other automatically-created corpora. We release all our data containing more than 15 million annotated instances in 5 different languages at http://trainomatic.org/onesec.

Sense-Annotated Corpora for Word Sense Disambiguation in Multiple Languages and Domains / Scarlini, B., Pasini, T., Navigli, R.. - (2020), pp. 5905-5911. (12th language resources and evaluation conference Marseille; France ).

Sense-Annotated Corpora for Word Sense Disambiguation in Multiple Languages and Domains

Scarlini Bianca^Primo;Pasini Tommaso^Secondo;Navigli Roberto^Ultimo

2020

Abstract

The knowledge acquisition bottleneck problem dramatically hampers the creation of sense-annotated data for Word Sense Disambiguation (WSD). Sense-annotated data are scarce for English and almost absent for other languages. This limits the range of action of deep-learning approaches, which today are at the base of any NLP task and are hungry for data. We mitigate this issue and encourage further research in multilingual WSD by releasing to the NLP community five large datasets annotated with word-senses in five different languages, namely, English, French, Italian, German and Spanish, and 5 distinct datasets in English, each for a different semantic domain. We show that supervised WSD models trained on our data attain higher performance than when trained on other automatically-created corpora. We release all our data containing more than 15 million annotated instances in 5 different languages at http://trainomatic.org/onesec.

Scheda breve

Scheda completa

	Anno di pubblicazione
	
				2020
			
	Nome convegno
	
				12th language resources and evaluation conference
			
	Parole chiave
	
				word sense disambiguation; semantics; multilinguality
			
	Tipologia
	
				04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
			
	Citazione
	
				Sense-Annotated Corpora for Word Sense Disambiguation in Multiple Languages and Domains / Scarlini, B., Pasini, T., Navigli, R.. - (2020), pp. 5905-5911. (12th language resources and evaluation conference Marseille; France ).
			
	Appartiene alla tipologia:
	
				04b Atto di convegno in volume

File allegati a questo prodotto

File	Dimensione	Formato
Scarlini_Sense-Annotated_2020.pdf accesso aperto Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore) Licenza: Creative commons Dimensione 299.52 kB Formato Adobe PDF	299.52 kB	Adobe PDF

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1431886

Citazioni

ND

15

1

social impact