A comparative study of duplicate record detection techniques

العناوين الأخرى

دراسة مقارنة لطرق اكتشاف السجلات المكررة

مقدم أطروحة جامعية

Aqil, Usamah Hilmi

مشرف أطروحة جامعية

Aqil, Misbah Jumah

أعضاء اللجنة

Uwayyid, Husayn H.
Slait, Azzam Talal

الجامعة

جامعة الشرق الأوسط

الكلية

كلية تكنولوجيا المعلومات

القسم الأكاديمي

قسم علم الحاسوب

دولة الجامعة

الأردن

الدرجة العلمية

ماجستير

تاريخ الدرجة العلمية

2012

الملخص الإنجليزي

-Duplicate record detection process is known as the process of identifying pairs of records in one or more datasets that correspond to the same real world entity (e.g.

patient or customer).

Despite the many techniques that have been proposed over the years for detecting approximate duplicate records in a database, there are a very few studies that compare the effectiveness of the various duplicate record detection techniques.

The purpose of this study is to compare two of the proposed decision models, the Rule-based technique, and the Probabilistic-based technique, that were proposed to detect the duplicate records in a given dataset, and evaluate their advantages and disadvantages.

Another aim is to design a generic framework to solve the problem of duplicate record detection.

Finally, the performance of the major decision models used in record matching stage is evaluated in this study.

Recently, there exist two main techniques for duplicate record detection, categorized into techniques that rely on domain knowledge or distance metrics, and techniques that rely on training data.

This study concentrates on comparison between Rule-based technique from the first category, and the Probabilistic-based technique from the second category.

For the Probabilistic-based technique, instead of relying on training data, we employed the Expectation Maximization (EM) algorithm to find maximum likelihood estimates of parameters in the probabilistic models.

Experimental results on the synthetic datasets are called FEBRL, which contains patients' data of different sizes and error characteristics.

These results show that the Probabilistic-based technique employing the EM algorithm yields better results than the Rule-based technique.

التخصصات الرئيسية

تكنولوجيا المعلومات وعلم الحاسوب

الموضوعات

عدد الصفحات

76

قائمة المحتويات

Table of contents.

Abstract.

Abstract in Arabic.

Chapter One : Introduction.

Chapter Two : Literature review and related studies.

Chapter Three : Duplicate detection framework design and implementation.

Chapter Four : Experimental evaluation.

Chapter Five : Conclusion and future work.

References.

نمط استشهاد جمعية علماء النفس الأمريكية (APA)

Aqil, Usamah Hilmi. (2012). A comparative study of duplicate record detection techniques. (Master's theses Theses and Dissertations Master). Middle East University, Jordan
https://search.emarefa.net/detail/BIM-698655

نمط استشهاد الجمعية الأمريكية للغات الحديثة (MLA)

Aqil, Usamah Hilmi. A comparative study of duplicate record detection techniques. (Master's theses Theses and Dissertations Master). Middle East University. (2012).
https://search.emarefa.net/detail/BIM-698655

نمط استشهاد الجمعية الطبية الأمريكية (AMA)

Aqil, Usamah Hilmi. (2012). A comparative study of duplicate record detection techniques. (Master's theses Theses and Dissertations Master). Middle East University, Jordan
https://search.emarefa.net/detail/BIM-698655

لغة النص

الإنجليزية

نوع البيانات

رسائل جامعية

رقم السجل

BIM-698655