The ability to extract hypernymy information on a large scale is becoming increasingly important in natural language processing, an area of the artificial intelligence which deals with the processing and understanding of natural language. While initial studies extracted this type of information from textual corpora by means of lexico-syntactic patterns, over time researchers moved to alternative, more structured sources of knowledge, such as Wikipedia. After the first attempts to extract is-a information fromWikipedia categories, a full line of research gave birth to numerous knowledge bases containing information which, however, is either incomplete or irremediably bound to English. To this end we put forward MultiWiBi, the first approach to the construction of a multilingual bitaxonomy which exploits the inner connection between Wikipedia pages and Wikipedia categories to induce a wide-coverage and fine-grained integrated taxonomy. A series of experiments show state-of-the-art results against all the available taxonomic resources available in the literature, also with respect to two novel measures of comparison. Another dimension where existing resources usually fall short is their degree of multilingualism. While knowledge is typically language agnostic, currently resources are able to extract relevant information only in languages providing highquality tools. In contrast, MultiWiBi does not leave any language behind: we show how to taxonomize Wikipedia in an arbitrary language and in a way that is fully independent of additional resources. At the core of our approach lies, in fact, the idea that the English version of Wikipedia can be linguistically exploited as a pivot to project the taxonomic information extracted from English to any other Wikipedia language in order to have a bitaxonomy in a second, arbitrary language; as a result, not only concepts which have an English equivalent are covered, but also those concepts which are not lexicalized in the source language. We also present the impact of having the taxonomized encyclopedic knowledge offered by MultiWiBi embedded into a semantic model of predicates (SPred) which crucially leverages Wikipedia to generalize collections of related noun phrases to infer a probability distribution over expected semantic classes. We applied SPred to a word sense disambiguation task and show that, when MultiWiBi is plugged in to replace an internal component, SPred’s generalization power increases as well as its precision and recall. Finally, we also published MultiWiBi as linked data, a paradigm which fosters interoperability and interconnection among resources and tools through the publication of data on the Web, and developed a public interface which lets the users navigate through MultiWiBi’s taxonomic structure in a graphical, captivating manner.
Learning of a multilingual bitaxonomy of Wikipedia and its application to semantic predicates / Flati, Tiziano. - ELETTRONICO. - (2015 Oct 05).
Learning of a multilingual bitaxonomy of Wikipedia and its application to semantic predicates
FLATI, TIZIANO
05/10/2015
Abstract
The ability to extract hypernymy information on a large scale is becoming increasingly important in natural language processing, an area of the artificial intelligence which deals with the processing and understanding of natural language. While initial studies extracted this type of information from textual corpora by means of lexico-syntactic patterns, over time researchers moved to alternative, more structured sources of knowledge, such as Wikipedia. After the first attempts to extract is-a information fromWikipedia categories, a full line of research gave birth to numerous knowledge bases containing information which, however, is either incomplete or irremediably bound to English. To this end we put forward MultiWiBi, the first approach to the construction of a multilingual bitaxonomy which exploits the inner connection between Wikipedia pages and Wikipedia categories to induce a wide-coverage and fine-grained integrated taxonomy. A series of experiments show state-of-the-art results against all the available taxonomic resources available in the literature, also with respect to two novel measures of comparison. Another dimension where existing resources usually fall short is their degree of multilingualism. While knowledge is typically language agnostic, currently resources are able to extract relevant information only in languages providing highquality tools. In contrast, MultiWiBi does not leave any language behind: we show how to taxonomize Wikipedia in an arbitrary language and in a way that is fully independent of additional resources. At the core of our approach lies, in fact, the idea that the English version of Wikipedia can be linguistically exploited as a pivot to project the taxonomic information extracted from English to any other Wikipedia language in order to have a bitaxonomy in a second, arbitrary language; as a result, not only concepts which have an English equivalent are covered, but also those concepts which are not lexicalized in the source language. We also present the impact of having the taxonomized encyclopedic knowledge offered by MultiWiBi embedded into a semantic model of predicates (SPred) which crucially leverages Wikipedia to generalize collections of related noun phrases to infer a probability distribution over expected semantic classes. We applied SPred to a word sense disambiguation task and show that, when MultiWiBi is plugged in to replace an internal component, SPred’s generalization power increases as well as its precision and recall. Finally, we also published MultiWiBi as linked data, a paradigm which fosters interoperability and interconnection among resources and tools through the publication of data on the Web, and developed a public interface which lets the users navigate through MultiWiBi’s taxonomic structure in a graphical, captivating manner.File | Dimensione | Formato | |
---|---|---|---|
Tesi dottorato Flati
accesso aperto
Tipologia:
Tesi di dottorato
Licenza:
Creative commons
Dimensione
3.76 MB
Formato
Adobe PDF
|
3.76 MB | Adobe PDF |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.