Language, in both the written and the oral forms, is the ground basis of living in society. The same basic kinds of rules and representations are shared across all the languages. Understand those rules is the objective of Natural Language Processing (NLP), the computerized discipline responsible to analyze and generate language. Building complex computational systems that mimic the human language and are capable to interact and collaborate with us is the holy grail of Natural Language Processing. Semantic representations are the rock-solid foundation on which many successful applications of NLP depend. Their main purpose is to extract and highlight the most important semantic features of textual data. Whereas over the years different approaches have been presented, lately, embeddings have become the dominant paradigm on vectorial representation of items. Currently, many outstanding NLP tasks rely on embeddings to achieve their performance. Embeddings are semantic spaces that carry valuable syntactic and semantic information. The name groups a set of feature learning techniques based on neural networks. Concretely, these techniques are capable to learn semantic spaces that effectively represent words as low-dimensional continuous vectors. They also maintain the structure of language by representing diverse lexical and semantic relations by a relation-specific vector offset. With the increasing amount of available text, as well as the increased computing power, techniques which take advantage of large volumes of unstructured data, as word embeddings, have become the prevailing approach of semantic representation of natural language. However, despite their enormous success, common word-embeddings approaches came with two inherent flaws: these representations are incapable to handle ambiguity, as senses of polysemous words are aggregated into single vectors. In addition, most word embeddings rely only on statistical information of word occurrences, leaving aside existing rich knowledge of structured data. To tackle the problem of polysemy, a fundamental task of Natural Language Processing (NLP), Word Sense Disambiguation (WSD), seems particularly suitable. The task, an open problem in the discipline, aims at identifying the correct meaning of word based given its context. Concretely, it links each word occurrence to a sense from a predefined inventory. Most successful approaches for WSD combine the use of unstructured data, manually annotated datasets and semantic resources. In the present thesis we address the issue of of ambiguity in semantic representations from a multimodal perspective. Firstly, we introduce and investigate new neural-based approaches to build better word and sense embeddings relying on both statistical data and prior semantic knowledge. We employ diverse techniques of WSD for linking word occurrences to their correct meaning on large amounts of raw corpora. Then, we use the resulting data as training input for learning the embeddings. We show the quality of these representations by evaluating them on standard semantic similarity frameworks reporting state-of-the-art performance on multiple datasets. Secondly, we show how these representations are capable to create better WSD systems. We introduce a new way to leverage word representations which outperforms current WSD approaches in both supervised and unsupervised configurations. We show that our WSD framework, based solely on embeddings, is capable to surpass WSD approaches based on standard features. Thirdly, we propose two new technique for leveraging semantic-annotated data. We incorporate more semantic features resulting in an increment in the performance compared with our initial approaches. We close the loop by showing that our semantic representations enhanced with WSD are also suitable for improving the task of WSD itself.

Neural-grounded semantic representations and word sense disambiguation: a mutually beneficial relationship / Iacobacci, IGNACIO JAVIER. - (2019 Feb 28).

Neural-grounded semantic representations and word sense disambiguation: a mutually beneficial relationship

IACOBACCI, IGNACIO JAVIER
28/02/2019

Abstract

Language, in both the written and the oral forms, is the ground basis of living in society. The same basic kinds of rules and representations are shared across all the languages. Understand those rules is the objective of Natural Language Processing (NLP), the computerized discipline responsible to analyze and generate language. Building complex computational systems that mimic the human language and are capable to interact and collaborate with us is the holy grail of Natural Language Processing. Semantic representations are the rock-solid foundation on which many successful applications of NLP depend. Their main purpose is to extract and highlight the most important semantic features of textual data. Whereas over the years different approaches have been presented, lately, embeddings have become the dominant paradigm on vectorial representation of items. Currently, many outstanding NLP tasks rely on embeddings to achieve their performance. Embeddings are semantic spaces that carry valuable syntactic and semantic information. The name groups a set of feature learning techniques based on neural networks. Concretely, these techniques are capable to learn semantic spaces that effectively represent words as low-dimensional continuous vectors. They also maintain the structure of language by representing diverse lexical and semantic relations by a relation-specific vector offset. With the increasing amount of available text, as well as the increased computing power, techniques which take advantage of large volumes of unstructured data, as word embeddings, have become the prevailing approach of semantic representation of natural language. However, despite their enormous success, common word-embeddings approaches came with two inherent flaws: these representations are incapable to handle ambiguity, as senses of polysemous words are aggregated into single vectors. In addition, most word embeddings rely only on statistical information of word occurrences, leaving aside existing rich knowledge of structured data. To tackle the problem of polysemy, a fundamental task of Natural Language Processing (NLP), Word Sense Disambiguation (WSD), seems particularly suitable. The task, an open problem in the discipline, aims at identifying the correct meaning of word based given its context. Concretely, it links each word occurrence to a sense from a predefined inventory. Most successful approaches for WSD combine the use of unstructured data, manually annotated datasets and semantic resources. In the present thesis we address the issue of of ambiguity in semantic representations from a multimodal perspective. Firstly, we introduce and investigate new neural-based approaches to build better word and sense embeddings relying on both statistical data and prior semantic knowledge. We employ diverse techniques of WSD for linking word occurrences to their correct meaning on large amounts of raw corpora. Then, we use the resulting data as training input for learning the embeddings. We show the quality of these representations by evaluating them on standard semantic similarity frameworks reporting state-of-the-art performance on multiple datasets. Secondly, we show how these representations are capable to create better WSD systems. We introduce a new way to leverage word representations which outperforms current WSD approaches in both supervised and unsupervised configurations. We show that our WSD framework, based solely on embeddings, is capable to surpass WSD approaches based on standard features. Thirdly, we propose two new technique for leveraging semantic-annotated data. We incorporate more semantic features resulting in an increment in the performance compared with our initial approaches. We close the loop by showing that our semantic representations enhanced with WSD are also suitable for improving the task of WSD itself.
28-feb-2019
File allegati a questo prodotto
File Dimensione Formato  
Tesi_dottorato_Iacobacci.pdf

accesso aperto

Tipologia: Tesi di dottorato
Licenza: Creative commons
Dimensione 1.08 MB
Formato Adobe PDF
1.08 MB Adobe PDF

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1304526
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact