This paper presents and analyzes an annotated corpus of definitions, created to train an algorithm for the automatic extraction of definitions and hypernyms fromWeb documents. As an additional resource, we also include a corpus of non-definitions with syntactic patterns similar to those of definition sentences, e.g.: "An android is a robot" vs. "Snowcap is unmistakable". Domain and style independence is obtained thanks to the annotation of a sample of the Wikipedia corpus and to a novel pattern generalization algorithm based on word-class lattices (WCL). A lattice is a directed acyclic graph (DAG), a subclass of nondeterministic finite state automata (NFA). The lattice structure has the purpose of preserving the salient differences among distinct sequences, while eliminating redundant information. The WCL algorithm will be integrated into an improved version of the GlossExtractor Web application (Velardi et al., 2008). This paper is mostly concerned with a description of the corpus, the annotation strategy, and a linguistic analysis of the data. A summary of the WCL algorithm is also provided for the sake of completeness.

An Annotated Dataset for Extracting Definitions and Hypernyms from the Web / NAVIGLI, ROBERTO; VELARDI, Paola; J., RUIZ MARTINEZ. - (2010), pp. 3716-3722. ((Intervento presentato al convegno 7th international conference on Language Resources and Evaluation (LREC 2010) tenutosi a Valletta; Malta.

An Annotated Dataset for Extracting Definitions and Hypernyms from the Web

NAVIGLI, ROBERTO;VELARDI, Paola;
2010

Abstract

This paper presents and analyzes an annotated corpus of definitions, created to train an algorithm for the automatic extraction of definitions and hypernyms fromWeb documents. As an additional resource, we also include a corpus of non-definitions with syntactic patterns similar to those of definition sentences, e.g.: "An android is a robot" vs. "Snowcap is unmistakable". Domain and style independence is obtained thanks to the annotation of a sample of the Wikipedia corpus and to a novel pattern generalization algorithm based on word-class lattices (WCL). A lattice is a directed acyclic graph (DAG), a subclass of nondeterministic finite state automata (NFA). The lattice structure has the purpose of preserving the salient differences among distinct sequences, while eliminating redundant information. The WCL algorithm will be integrated into an improved version of the GlossExtractor Web application (Velardi et al., 2008). This paper is mostly concerned with a description of the corpus, the annotation strategy, and a linguistic analysis of the data. A summary of the WCL algorithm is also provided for the sake of completeness.
File allegati a questo prodotto
File Dimensione Formato  
VE_2010_11573-186275.pdf

solo gestori archivio

Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza: Tutti i diritti riservati (All rights reserved)
Dimensione 416.72 kB
Formato Adobe PDF
416.72 kB Adobe PDF   Visualizza/Apri   Richiedi una copia

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: http://hdl.handle.net/11573/186275
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 34
  • ???jsp.display-item.citation.isi??? 2
social impact