Performance evaluation of blocking methods for duplicate record detection

العناوين الأخرى

تقييم أداء أساليب التجزئة في اكتشاف السجلات المكررة

مقدم أطروحة جامعية

al-Dumur, Raid Mahmud

مشرف أطروحة جامعية

Aqil, Misbah M.

أعضاء اللجنة

Shilbayah, Nidal F.
al-Umari, Ahmad H.

الجامعة

جامعة الشرق الأوسط

الكلية

كلية تكنولوجيا المعلومات

القسم الأكاديمي

قسم نظم المعلومات الحاسوبية

دولة الجامعة

الأردن

الدرجة العلمية

ماجستير

تاريخ الدرجة العلمية

2010

الملخص الإنجليزي

Duplicate record detection process is the process of identifying pairs of records in one or more datasets that refer to the same real world entity (e.g.

patient or customer), where these individual entities might be erroneous and incomplete.

In addition, there exists no unique identifying key for these entities that would allow to directly identifying them as duplicates.

A main challenge when detecting duplicate records is the complexity of the detection process: potentially each record in a dataset has to be compared to all records in the same dataset or another dataset, the number of record pair comparisons grows quadratically with the number of records to be compared.

A large variety of methods, collectively known as blocking methods, have been developed to deal with this quadratic complexity problem.

Blocking Methods reduce the number potential record pair comparisons by partitioning the datasets into a set of mutually exclusive blocks or clusters using a blocking key (i.e.

a single record attribute or a combination of attributes).

All records sharing the same blocking key value will be placed in the same block and only records within a block will be compared.

In this thesis, we experimentally compare and evaluate two recently developed, blocking methods, the sorted blocks and standard suffix array and its improvements, with two older methods, the standard blocking and sorted neighborhood blocking within a common framework with regard to the quality of the candidate record pairs generated by them.

The experiments results on synthetic dataset show that sorted neighborhood blocking method outperforms the standard blocking and that sorted blocks slightly outperforms it in terms of accuracy.

Also our experiments results show that the accuracy of the improved suffix array method is much higher than the standard suffix array and that standard blocking can be dramatically improved using standard suffix array and its improvements.

التخصصات الرئيسية

تكنولوجيا المعلومات وعلم الحاسوب

عدد الصفحات

99

قائمة المحتويات

Table of contents.

Abstract.

Abstract in Arabic.

Chapter One : Introduction.

Chapter Two : Literature review.

Chapter Three : Duplicate detection framework design.

Chapter Four : Experimental evaluation.

Chapter Five : Conclusions and future work.

References.

نمط استشهاد جمعية علماء النفس الأمريكية (APA)

al-Dumur, Raid Mahmud. (2010). Performance evaluation of blocking methods for duplicate record detection. (Master's theses Theses and Dissertations Master). Middle East University, Jordan
https://search.emarefa.net/detail/BIM-694872

نمط استشهاد الجمعية الأمريكية للغات الحديثة (MLA)

al-Dumur, Raid Mahmud. Performance evaluation of blocking methods for duplicate record detection. (Master's theses Theses and Dissertations Master). Middle East University. (2010).
https://search.emarefa.net/detail/BIM-694872

نمط استشهاد الجمعية الطبية الأمريكية (AMA)

al-Dumur, Raid Mahmud. (2010). Performance evaluation of blocking methods for duplicate record detection. (Master's theses Theses and Dissertations Master). Middle East University, Jordan
https://search.emarefa.net/detail/BIM-694872

لغة النص

الإنجليزية

نوع البيانات

رسائل جامعية

رقم السجل

BIM-694872