A significant portion of the information collected by enterprises and organizations resides in text documents and is thus inherently unstructured. Turning it into a structured form is the aim of Information Extraction (IE). Depending on the approach, the output of an IE process can fill forms, populate relational tables, or even be presented through an ontology. This last approach, known in the literature under the name of Ontology Based Information Extraction (OBIE), is particularly interesting, since ontologies may facilitate the integration with other corporate and external data and enable data management and governance at an abstract, conceptual level.However, despite OBIE has been so far the subject of several investigations, how to exploit the reasoning abilities offered by an ontology to improve the extraction process has not yet been specifically studied. This thesis is intended to be a first step in that direction. Starting from our experience gained from implementing OBIE systems via open-source technologies, and with the intent to address the encountered weaknesses, we propose a formal framework for OBIE, called Ontology Based Document Spanning (OBDS). We devise our proposal by revisiting the Ontology Based Data Access (ODBA) paradigm, a sophisticated form of semantic data integration from relational databases, and leveraging the investigation on Document Spanners, a recent formal study of rule-based information extraction that follows the database principles. The reasoning service of main interest in OBDS, as usual in ontology based data management approaches, is Query Answering (Q. A.). We provide an analysis of this service in different settings and propose algorithms for Q. A., in the spirit of OBDA. Right here we show how the ontology plays a major role by mediating the extraction of information from text. To demonstrate the applicability of our approach in practice, we illustrate Mastro System-T, an OBDS tool that we have implemented using robust industrial technologies and experimented on large document datasets. Last but not least, we formally treat the problem of the Entity Resolution (ER), which is recurrent in the OBIE context, as in general in information integration approaches.
Ontology-based information extraction experiences, framework, algorithms and tools / Scafoglieri, Federico. - (2021 Sep 17).
Ontology-based information extraction experiences, framework, algorithms and tools
Scafoglieri, Federico
17/09/2021
Abstract
A significant portion of the information collected by enterprises and organizations resides in text documents and is thus inherently unstructured. Turning it into a structured form is the aim of Information Extraction (IE). Depending on the approach, the output of an IE process can fill forms, populate relational tables, or even be presented through an ontology. This last approach, known in the literature under the name of Ontology Based Information Extraction (OBIE), is particularly interesting, since ontologies may facilitate the integration with other corporate and external data and enable data management and governance at an abstract, conceptual level.However, despite OBIE has been so far the subject of several investigations, how to exploit the reasoning abilities offered by an ontology to improve the extraction process has not yet been specifically studied. This thesis is intended to be a first step in that direction. Starting from our experience gained from implementing OBIE systems via open-source technologies, and with the intent to address the encountered weaknesses, we propose a formal framework for OBIE, called Ontology Based Document Spanning (OBDS). We devise our proposal by revisiting the Ontology Based Data Access (ODBA) paradigm, a sophisticated form of semantic data integration from relational databases, and leveraging the investigation on Document Spanners, a recent formal study of rule-based information extraction that follows the database principles. The reasoning service of main interest in OBDS, as usual in ontology based data management approaches, is Query Answering (Q. A.). We provide an analysis of this service in different settings and propose algorithms for Q. A., in the spirit of OBDA. Right here we show how the ontology plays a major role by mediating the extraction of information from text. To demonstrate the applicability of our approach in practice, we illustrate Mastro System-T, an OBDS tool that we have implemented using robust industrial technologies and experimented on large document datasets. Last but not least, we formally treat the problem of the Entity Resolution (ER), which is recurrent in the OBIE context, as in general in information integration approaches.File | Dimensione | Formato | |
---|---|---|---|
Tesi_dottorato_Scafoglieri.pdf
accesso aperto
Note: Tesi completa
Tipologia:
Tesi di dottorato
Licenza:
Creative commons
Dimensione
4.76 MB
Formato
Adobe PDF
|
4.76 MB | Adobe PDF |
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.