Performance evaluation of blocking methods for duplicate record detection
Other Title(s)
تقييم أداء أساليب التجزئة في اكتشاف السجلات المكررة
Dissertant
Thesis advisor
Comitee Members
Shilbayah, Nidal F.
al-Umari, Ahmad H.
University
Middle East University
Faculty
Faculty of Information Technology
Department
Department of Computer Information Systems
University Country
Jordan
Degree
Master
Degree Date
2010
English Abstract
Duplicate record detection process is the process of identifying pairs of records in one or more datasets that refer to the same real world entity (e.g.
patient or customer), where these individual entities might be erroneous and incomplete.
In addition, there exists no unique identifying key for these entities that would allow to directly identifying them as duplicates.
A main challenge when detecting duplicate records is the complexity of the detection process: potentially each record in a dataset has to be compared to all records in the same dataset or another dataset, the number of record pair comparisons grows quadratically with the number of records to be compared.
A large variety of methods, collectively known as blocking methods, have been developed to deal with this quadratic complexity problem.
Blocking Methods reduce the number potential record pair comparisons by partitioning the datasets into a set of mutually exclusive blocks or clusters using a blocking key (i.e.
a single record attribute or a combination of attributes).
All records sharing the same blocking key value will be placed in the same block and only records within a block will be compared.
In this thesis, we experimentally compare and evaluate two recently developed, blocking methods, the sorted blocks and standard suffix array and its improvements, with two older methods, the standard blocking and sorted neighborhood blocking within a common framework with regard to the quality of the candidate record pairs generated by them.
The experiments results on synthetic dataset show that sorted neighborhood blocking method outperforms the standard blocking and that sorted blocks slightly outperforms it in terms of accuracy.
Also our experiments results show that the accuracy of the improved suffix array method is much higher than the standard suffix array and that standard blocking can be dramatically improved using standard suffix array and its improvements.
Main Subjects
Information Technology and Computer Science
No. of Pages
99
Table of Contents
Table of contents.
Abstract.
Abstract in Arabic.
Chapter One : Introduction.
Chapter Two : Literature review.
Chapter Three : Duplicate detection framework design.
Chapter Four : Experimental evaluation.
Chapter Five : Conclusions and future work.
References.
American Psychological Association (APA)
al-Dumur, Raid Mahmud. (2010). Performance evaluation of blocking methods for duplicate record detection. (Master's theses Theses and Dissertations Master). Middle East University, Jordan
https://search.emarefa.net/detail/BIM-694872
Modern Language Association (MLA)
al-Dumur, Raid Mahmud. Performance evaluation of blocking methods for duplicate record detection. (Master's theses Theses and Dissertations Master). Middle East University. (2010).
https://search.emarefa.net/detail/BIM-694872
American Medical Association (AMA)
al-Dumur, Raid Mahmud. (2010). Performance evaluation of blocking methods for duplicate record detection. (Master's theses Theses and Dissertations Master). Middle East University, Jordan
https://search.emarefa.net/detail/BIM-694872
Language
English
Data Type
Arab Theses
Record ID
BIM-694872