Performance evaluation of blocking methods for duplicate record detection

العناوين الأخرى

تقييم أداء أساليب التجزئة في اكتشاف السجلات المكررة

الجامعة

جامعة الشرق الأوسط

الكلية

كلية تكنولوجيا المعلومات

القسم الأكاديمي

قسم نظم المعلومات الحاسوبية

دولة الجامعة

الأردن

الدرجة العلمية

ماجستير

تاريخ الدرجة العلمية

2010

الملخص الإنجليزي

Duplicate record detection process is the process of identifying pairs of records in one or more datasets that refer to the same real world entity (e.g.

patient or customer), where these individual entities might be erroneous and incomplete.

In addition, there exists no unique identifying key for these entities that would allow to directly identifying them as duplicates.

A main challenge when detecting duplicate records is the complexity of the detection process: potentially each record in a dataset has to be compared to all records in the same dataset or another dataset, the number of record pair comparisons grows quadratically with the number of records to be compared.

A large variety of methods, collectively known as blocking methods, have been developed to deal with this quadratic complexity problem.

Blocking Methods reduce the number potential record pair comparisons by partitioning the datasets into a set of mutually exclusive blocks or clusters using a blocking key (i.e.

a single record attribute or a combination of attributes).

All records sharing the same blocking key value will be placed in the same block and only records within a block will be compared.

In this thesis, we experimentally compare and evaluate two recently developed, blocking methods, the sorted blocks and standard suffix array and its improvements, with two older methods, the standard blocking and sorted neighborhood blocking within a common framework with regard to the quality of the candidate record pairs generated by them.

The experiments results on synthetic dataset show that sorted neighborhood blocking method outperforms the standard blocking and that sorted blocks slightly outperforms it in terms of accuracy.

Also our experiments results show that the accuracy of the improved suffix array method is much higher than the standard suffix array and that standard blocking can be dramatically improved using standard suffix array and its improvements.

التخصصات الرئيسية

تكنولوجيا المعلومات وعلم الحاسوب

عدد الصفحات

قائمة المحتويات

Table of contents.

Abstract.

Abstract in Arabic.

Chapter One : Introduction.

Chapter Two : Literature review.

Chapter Three : Duplicate detection framework design.

Chapter Four : Experimental evaluation.

Chapter Five : Conclusions and future work.

References.

نمط استشهاد جمعية علماء النفس الأمريكية (APA)

al-Dumur, Raid Mahmud. (2010). Performance evaluation of blocking methods for duplicate record detection. (Master's theses Theses and Dissertations Master). Middle East University, Jordan
https://search.emarefa.net/detail/BIM-694872

نمط استشهاد الجمعية الأمريكية للغات الحديثة (MLA)

al-Dumur, Raid Mahmud. Performance evaluation of blocking methods for duplicate record detection. (Master's theses Theses and Dissertations Master). Middle East University. (2010).
https://search.emarefa.net/detail/BIM-694872

نمط استشهاد الجمعية الطبية الأمريكية (AMA)

لغة النص

الإنجليزية

نوع البيانات

رسائل جامعية

رقم السجل

BIM-694872

حفظتم الحفظ طباعة

قاعدة معامل التأثير والاستشهادات المرجعية العربي "ارسيف Arcif"

أضخم قاعدة بيانات عربية للاستشهادات المرجعية للمجلات العلمية المحكمة الصادرة في العالم العربي

مرصد "معرفة"
لقياس الإنتاج العلمي العربي

تقوم هذه الخدمة بالتحقق من التشابه أو الانتحال في الأبحاث والمقالات العلمية والأطروحات الجامعية والكتب والأبحاث باللغة العربية، وتحديد درجة التشابه أو أصالة الأعمال البحثية وحماية ملكيتها الفكرية. تعرف اكثر