Performance evaluation of blocking methods for duplicate record detection

Other Title(s)

تقييم أداء أساليب التجزئة في اكتشاف السجلات المكررة

Dissertant

al-Dumur, Raid Mahmud

Thesis advisor

Aqil, Misbah M.

Comitee Members

Shilbayah, Nidal F.
al-Umari, Ahmad H.

University

Middle East University

Faculty

Faculty of Information Technology

Department

Department of Computer Information Systems

University Country

Jordan

Degree

Master

Degree Date

2010

English Abstract

Duplicate record detection process is the process of identifying pairs of records in one or more datasets that refer to the same real world entity (e.g.

patient or customer), where these individual entities might be erroneous and incomplete.

In addition, there exists no unique identifying key for these entities that would allow to directly identifying them as duplicates.

A main challenge when detecting duplicate records is the complexity of the detection process: potentially each record in a dataset has to be compared to all records in the same dataset or another dataset, the number of record pair comparisons grows quadratically with the number of records to be compared.

A large variety of methods, collectively known as blocking methods, have been developed to deal with this quadratic complexity problem.

Blocking Methods reduce the number potential record pair comparisons by partitioning the datasets into a set of mutually exclusive blocks or clusters using a blocking key (i.e.

a single record attribute or a combination of attributes).

All records sharing the same blocking key value will be placed in the same block and only records within a block will be compared.

In this thesis, we experimentally compare and evaluate two recently developed, blocking methods, the sorted blocks and standard suffix array and its improvements, with two older methods, the standard blocking and sorted neighborhood blocking within a common framework with regard to the quality of the candidate record pairs generated by them.

The experiments results on synthetic dataset show that sorted neighborhood blocking method outperforms the standard blocking and that sorted blocks slightly outperforms it in terms of accuracy.

Also our experiments results show that the accuracy of the improved suffix array method is much higher than the standard suffix array and that standard blocking can be dramatically improved using standard suffix array and its improvements.

Main Subjects

Information Technology and Computer Science

No. of Pages

99

Table of Contents

Table of contents.

Abstract.

Abstract in Arabic.

Chapter One : Introduction.

Chapter Two : Literature review.

Chapter Three : Duplicate detection framework design.

Chapter Four : Experimental evaluation.

Chapter Five : Conclusions and future work.

References.

American Psychological Association (APA)

al-Dumur, Raid Mahmud. (2010). Performance evaluation of blocking methods for duplicate record detection. (Master's theses Theses and Dissertations Master). Middle East University, Jordan
https://search.emarefa.net/detail/BIM-694872

Modern Language Association (MLA)

al-Dumur, Raid Mahmud. Performance evaluation of blocking methods for duplicate record detection. (Master's theses Theses and Dissertations Master). Middle East University. (2010).
https://search.emarefa.net/detail/BIM-694872

American Medical Association (AMA)

al-Dumur, Raid Mahmud. (2010). Performance evaluation of blocking methods for duplicate record detection. (Master's theses Theses and Dissertations Master). Middle East University, Jordan
https://search.emarefa.net/detail/BIM-694872

Language

English

Data Type

Arab Theses

Record ID

BIM-694872