Record Linkage (RL) aims at identifying pairs of records coming from different sources and representing the same real world entity. Several methods have been proposed to face RL problems and many independent software implementations of traditional methods exist. However, none of the available systems seems to be at the same time fully automated and very effective. In this paper we describe and test a new RL software that, instead, possesses both these abilities: the MAERLIN system. MAERLIN implements a novel suite of methods for RL, based on Mixture Models. Such methods allow our system to obtain accurate and reliable results without relying on domain knowledge, thus not jeopardizing automation. The system adopts a two-component Beta mixture model and finds Maximum Likelihood estimates of mixture parameters by means of an original Perturbative Fitting technique. Then, it obtains a probabilistic clustering of record pairs into Matches and Unmatches by finding optimal classification rules under arbitrary matching constraints through a purposefully designed Evolutionary Algorithm. In this paper, we provide an overview of the MAERLIN system. Then, we describe the RELAIS toolkit, which includes a state-of-the-art implementation of the traditional Fellegi-Sunter method for probabilistic RL. Finally, we provide an extensive experimental analysis comparing MAERLIN to RELAIS. Specifically, we present several experiments on challenging real-world RL instances arising from distinct application domains. The obtained results show the significant effectiveness and robustness of the methods underlying MAERLIN and also reveal interesting findings arising from the aforementioned comparative evaluation.
On probabilistic record linkage: New methods compared to the fellegi-sunter approach / D., Zardetto; M., Scannapieco; L., Valentino; Catarci, Tiziana. - STAMPA. - (2011), pp. 21-32. (Intervento presentato al convegno 19th Italian Symposium on Advanced Database Systems, SEBD 2011 tenutosi a Maratea; Italy nel 26 June 2011 through 29 June 2011).
On probabilistic record linkage: New methods compared to the fellegi-sunter approach
CATARCI, Tiziana
2011
Abstract
Record Linkage (RL) aims at identifying pairs of records coming from different sources and representing the same real world entity. Several methods have been proposed to face RL problems and many independent software implementations of traditional methods exist. However, none of the available systems seems to be at the same time fully automated and very effective. In this paper we describe and test a new RL software that, instead, possesses both these abilities: the MAERLIN system. MAERLIN implements a novel suite of methods for RL, based on Mixture Models. Such methods allow our system to obtain accurate and reliable results without relying on domain knowledge, thus not jeopardizing automation. The system adopts a two-component Beta mixture model and finds Maximum Likelihood estimates of mixture parameters by means of an original Perturbative Fitting technique. Then, it obtains a probabilistic clustering of record pairs into Matches and Unmatches by finding optimal classification rules under arbitrary matching constraints through a purposefully designed Evolutionary Algorithm. In this paper, we provide an overview of the MAERLIN system. Then, we describe the RELAIS toolkit, which includes a state-of-the-art implementation of the traditional Fellegi-Sunter method for probabilistic RL. Finally, we provide an extensive experimental analysis comparing MAERLIN to RELAIS. Specifically, we present several experiments on challenging real-world RL instances arising from distinct application domains. The obtained results show the significant effectiveness and robustness of the methods underlying MAERLIN and also reveal interesting findings arising from the aforementioned comparative evaluation.I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.