In this paper, we discuss the application of concept of data quality to big data by highlighting how much complex is to define it in a general way. Already data quality is a multidimensional concept, difficult to characterize in precise definitions even in the case of well-structured data. Big data add two further dimensions of complexity: (i) being “very” source specific, and for this we adopt the interesting UNECE classification, and (ii) being highly unstructured and schema-less, often without golden standards to refer to or very difficult to access. After providing a tutorial on data quality in traditional contexts, we analyze big data by providing insights into the UNECE classification, and then, for each type of data source, we choose a specific instance of such a type (notably deep Web data, sensor-generated data, and Twitters/short texts) and discuss how quality dimensions can be defined in these cases. The overall aim of the paper is therefore to identify further research directions in the area of big data quality, by providing at the same time an up-to-date state of the art on data quality. © 2015, The Author(s).

On the Meaningfulness of “Big Data Quality” (Invited Paper) / Firmani, Donatella; Mecella, Massimo; Scannapieco, Monica; Batini, Carlo. - In: DATA SCIENCE AND ENGINEERING. - ISSN 2364-1185. - 1:1(2016), pp. 6-20. [10.1007/s41019-015-0004-7]

On the Meaningfulness of “Big Data Quality” (Invited Paper)

FIRMANI, DONATELLA;MECELLA, Massimo
;
SCANNAPIECO, Monica;BATINI, Carlo
2016

Abstract

In this paper, we discuss the application of concept of data quality to big data by highlighting how much complex is to define it in a general way. Already data quality is a multidimensional concept, difficult to characterize in precise definitions even in the case of well-structured data. Big data add two further dimensions of complexity: (i) being “very” source specific, and for this we adopt the interesting UNECE classification, and (ii) being highly unstructured and schema-less, often without golden standards to refer to or very difficult to access. After providing a tutorial on data quality in traditional contexts, we analyze big data by providing insights into the UNECE classification, and then, for each type of data source, we choose a specific instance of such a type (notably deep Web data, sensor-generated data, and Twitters/short texts) and discuss how quality dimensions can be defined in these cases. The overall aim of the paper is therefore to identify further research directions in the area of big data quality, by providing at the same time an up-to-date state of the art on data quality. © 2015, The Author(s).
2016
Data quality; Big data; Quality dimensions; Information quality
01 Pubblicazione su rivista::01a Articolo in rivista
On the Meaningfulness of “Big Data Quality” (Invited Paper) / Firmani, Donatella; Mecella, Massimo; Scannapieco, Monica; Batini, Carlo. - In: DATA SCIENCE AND ENGINEERING. - ISSN 2364-1185. - 1:1(2016), pp. 6-20. [10.1007/s41019-015-0004-7]
File allegati a questo prodotto
File Dimensione Formato  
Firmani_On-The-Meaningfulness_2016.pdf

accesso aperto

Tipologia: Versione editoriale (versione pubblicata con il layout dell'editore)
Licenza: Creative commons
Dimensione 770.32 kB
Formato Adobe PDF
770.32 kB Adobe PDF

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/908569
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 76
  • ???jsp.display-item.citation.isi??? 41
social impact