Catalogo dei prodotti della ricerca

The exponential growth of textual data shared online has created an urgent need for methods that can effectively extract, structure, and interpret information from vast and varied texts. Information Extraction (IE), a key area within Natural Language Processing (NLP), addresses this need by transforming unstructured text into structured formats enabling automated text analytics and decision-making. However, existing IE systems face substantial challenges in scalability and generalization. These challenges include limited labeled data for low-resource languages, computational demands that restrict accessibility to only well-resourced institutions, and a predominant focus on popular entities. Additionally, most IE tasks are entity-centric tasks (e.g. Named Entity Recognition, Entity Disambiguation, and Relation Extraction), thus overlooking the broader contextual richness present in many texts. This thesis aims at advancing the field of IE by tackling these critical issues through novel resources, methodologies, and theoretical approaches aimed at fostering a multilingual, scalable, and semantically-enriched IE framework. To bridge the multilingual gap, we leverage a combination of neural and knowledge-based approaches and create multilingual datasets for NER and Relation Extraction, ensuring that IE systems can operate effectively across diverse linguistic settings. On the computational front, we propose optimizations designed to reduce the resource requirements of IE models, especially in the context of Entity Disambiguation, enabling broader adoption of NLP technologies by reducing dependence on high-performance hardware and extensive labeled datasets. Additionally, this work challenges traditional IE frameworks by expanding the focus beyond named entities to encompass abstract concepts, idiomatic expressions, and tail entities, which are essential for a more nuanced and comprehensive understanding of texts. Through these contributions, this research aims to establish a robust foundation for multilingual, resource-efficient IE systems that can meet the evolving demands of global text analytics across varied languages, domains, and cultural contexts. Finally, to further encourage the usage and development of multilingual IE systems, we publicly release all the artifacts -- datasets and models -- introduced in this thesis.

Towards comprehensive and efficient information extraction across languages / Tedeschi, Simone. - (2025 Jan 24).

Towards comprehensive and efficient information extraction across languages

TEDESCHI, SIMONE

24/01/2025

Abstract

The exponential growth of textual data shared online has created an urgent need for methods that can effectively extract, structure, and interpret information from vast and varied texts. Information Extraction (IE), a key area within Natural Language Processing (NLP), addresses this need by transforming unstructured text into structured formats enabling automated text analytics and decision-making. However, existing IE systems face substantial challenges in scalability and generalization. These challenges include limited labeled data for low-resource languages, computational demands that restrict accessibility to only well-resourced institutions, and a predominant focus on popular entities. Additionally, most IE tasks are entity-centric tasks (e.g. Named Entity Recognition, Entity Disambiguation, and Relation Extraction), thus overlooking the broader contextual richness present in many texts. This thesis aims at advancing the field of IE by tackling these critical issues through novel resources, methodologies, and theoretical approaches aimed at fostering a multilingual, scalable, and semantically-enriched IE framework. To bridge the multilingual gap, we leverage a combination of neural and knowledge-based approaches and create multilingual datasets for NER and Relation Extraction, ensuring that IE systems can operate effectively across diverse linguistic settings. On the computational front, we propose optimizations designed to reduce the resource requirements of IE models, especially in the context of Entity Disambiguation, enabling broader adoption of NLP technologies by reducing dependence on high-performance hardware and extensive labeled datasets. Additionally, this work challenges traditional IE frameworks by expanding the focus beyond named entities to encompass abstract concepts, idiomatic expressions, and tail entities, which are essential for a more nuanced and comprehensive understanding of texts. Through these contributions, this research aims to establish a robust foundation for multilingual, resource-efficient IE systems that can meet the evolving demands of global text analytics across varied languages, domains, and cultural contexts. Finally, to further encourage the usage and development of multilingual IE systems, we publicly release all the artifacts -- datasets and models -- introduced in this thesis.

Scheda breve

Scheda completa

Data di discussione

24-gen-2025

Appartiene alla tipologia:

07a Tesi di Dottorato

File allegati a questo prodotto

File	Dimensione	Formato
Tesi_dottorato_Tedeschi.pdf accesso aperto Note: Tesi completa Tipologia: Tesi di dottorato Licenza: Tutti i diritti riservati (All rights reserved) Dimensione 4.96 MB Formato Adobe PDF	4.96 MB	Adobe PDF

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1733001

Citazioni

ND

ND

ND

social impact