Just “OneSeC” for producing multilingual Sense-Annotated Data

Scarlini, Bianca; Pasini, Tommaso; Navigli, Roberto

doi:10.18653/v1/P19-1069

The well-known problem of knowledge acquisition is one of the biggest issues in Word Sense Disambiguation (WSD), where annotated data are still scarce in English and almost absent in other languages. In this paper we formulate the assumption of One Sense per Wikipedia Category and present OneSeC, a language-independent method for the automatic extraction of hundreds of thousands of sentences in which a target word is tagged with its meaning. Our automatically-generated data consistently lead a supervised WSD model to state-of-the-art performance when compared with other automatic and semi-automatic methods. Moreover, our approach outperforms its competitors on multilingual and domain-specific settings, where it beats the existing state of the art on all languages and most domains. All the training data are available for research purposes at http://trainomatic.org/onesec.

Just “OneSeC” for producing multilingual Sense-Annotated Data / Scarlini, B., Pasini, T., Navigli, R.. - (2019), pp. 699-709. (57th Annual Meeting of the Association for Computational Linguistics, ACL 2019 Florence, Italy ) [10.18653/v1/P19-1069].

Just “OneSeC” for producing multilingual Sense-Annotated Data

Scarlini, Bianca^Primo;Pasini, Tommaso^Secondo;Navigli, Roberto^Ultimo

2019

Abstract

The well-known problem of knowledge acquisition is one of the biggest issues in Word Sense Disambiguation (WSD), where annotated data are still scarce in English and almost absent in other languages. In this paper we formulate the assumption of One Sense per Wikipedia Category and present OneSeC, a language-independent method for the automatic extraction of hundreds of thousands of sentences in which a target word is tagged with its meaning. Our automatically-generated data consistently lead a supervised WSD model to state-of-the-art performance when compared with other automatic and semi-automatic methods. Moreover, our approach outperforms its competitors on multilingual and domain-specific settings, where it beats the existing state of the art on all languages and most domains. All the training data are available for research purposes at http://trainomatic.org/onesec.

Scheda breve

Scheda completa

	Anno di pubblicazione
	
				2019
			
	Nome convegno
	
				57th Annual Meeting of the Association for Computational Linguistics, ACL 2019
			
	Parole chiave
	
				natural language processing; word sense disambiguation; multilinguality
			
	Tipologia
	
				04 Pubblicazione in atti di convegno::04b Atto di convegno in volume
			
	Citazione
	
				Just “OneSeC” for producing multilingual Sense-Annotated Data / Scarlini, B., Pasini, T., Navigli, R.. - (2019), pp. 699-709. (57th Annual Meeting of the Association for Computational Linguistics, ACL 2019 Florence, Italy ) [10.18653/v1/P19-1069].
			
	Appartiene alla tipologia:
	
				04b Atto di convegno in volume

File allegati a questo prodotto

File	Dimensione	Formato
Scarlini_Just_2019.pdf accesso aperto Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore) Licenza: Creative commons Dimensione 417.55 kB Formato Adobe PDF	417.55 kB	Adobe PDF

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1349845

Citazioni

ND

ND

24

Catalogo dei prodotti della ricerca