Performance evaluation of blocking methods for duplicate record detection
العناوين الأخرى
تقييم أداء أساليب التجزئة في اكتشاف السجلات المكررة
مقدم أطروحة جامعية
مشرف أطروحة جامعية
أعضاء اللجنة
Shilbayah, Nidal F.
al-Umari, Ahmad H.
الجامعة
جامعة الشرق الأوسط
الكلية
كلية تكنولوجيا المعلومات
القسم الأكاديمي
قسم نظم المعلومات الحاسوبية
دولة الجامعة
الأردن
الدرجة العلمية
ماجستير
تاريخ الدرجة العلمية
2010
الملخص الإنجليزي
Duplicate record detection process is the process of identifying pairs of records in one or more datasets that refer to the same real world entity (e.g.
patient or customer), where these individual entities might be erroneous and incomplete.
In addition, there exists no unique identifying key for these entities that would allow to directly identifying them as duplicates.
A main challenge when detecting duplicate records is the complexity of the detection process: potentially each record in a dataset has to be compared to all records in the same dataset or another dataset, the number of record pair comparisons grows quadratically with the number of records to be compared.
A large variety of methods, collectively known as blocking methods, have been developed to deal with this quadratic complexity problem.
Blocking Methods reduce the number potential record pair comparisons by partitioning the datasets into a set of mutually exclusive blocks or clusters using a blocking key (i.e.
a single record attribute or a combination of attributes).
All records sharing the same blocking key value will be placed in the same block and only records within a block will be compared.
In this thesis, we experimentally compare and evaluate two recently developed, blocking methods, the sorted blocks and standard suffix array and its improvements, with two older methods, the standard blocking and sorted neighborhood blocking within a common framework with regard to the quality of the candidate record pairs generated by them.
The experiments results on synthetic dataset show that sorted neighborhood blocking method outperforms the standard blocking and that sorted blocks slightly outperforms it in terms of accuracy.
Also our experiments results show that the accuracy of the improved suffix array method is much higher than the standard suffix array and that standard blocking can be dramatically improved using standard suffix array and its improvements.
التخصصات الرئيسية
تكنولوجيا المعلومات وعلم الحاسوب
عدد الصفحات
99
قائمة المحتويات
Table of contents.
Abstract.
Abstract in Arabic.
Chapter One : Introduction.
Chapter Two : Literature review.
Chapter Three : Duplicate detection framework design.
Chapter Four : Experimental evaluation.
Chapter Five : Conclusions and future work.
References.
نمط استشهاد جمعية علماء النفس الأمريكية (APA)
al-Dumur, Raid Mahmud. (2010). Performance evaluation of blocking methods for duplicate record detection. (Master's theses Theses and Dissertations Master). Middle East University, Jordan
https://search.emarefa.net/detail/BIM-694872
نمط استشهاد الجمعية الأمريكية للغات الحديثة (MLA)
al-Dumur, Raid Mahmud. Performance evaluation of blocking methods for duplicate record detection. (Master's theses Theses and Dissertations Master). Middle East University. (2010).
https://search.emarefa.net/detail/BIM-694872
نمط استشهاد الجمعية الطبية الأمريكية (AMA)
al-Dumur, Raid Mahmud. (2010). Performance evaluation of blocking methods for duplicate record detection. (Master's theses Theses and Dissertations Master). Middle East University, Jordan
https://search.emarefa.net/detail/BIM-694872
لغة النص
الإنجليزية
نوع البيانات
رسائل جامعية
رقم السجل
BIM-694872
قاعدة معامل التأثير والاستشهادات المرجعية العربي "ارسيف Arcif"
أضخم قاعدة بيانات عربية للاستشهادات المرجعية للمجلات العلمية المحكمة الصادرة في العالم العربي
تقوم هذه الخدمة بالتحقق من التشابه أو الانتحال في الأبحاث والمقالات العلمية والأطروحات الجامعية والكتب والأبحاث باللغة العربية، وتحديد درجة التشابه أو أصالة الأعمال البحثية وحماية ملكيتها الفكرية. تعرف اكثر