Prediction of missing data technique to improve big data classification

مقدم أطروحة جامعية

Husayn, Huda

مشرف أطروحة جامعية

al-Harub, Aishah

الجامعة

جامعة الإسراء

الكلية

كلية تكنولوجيا المعلومات

القسم الأكاديمي

قسم هندسة البرمجيات

دولة الجامعة

الأردن

الدرجة العلمية

ماجستير

تاريخ الدرجة العلمية

2020

الملخص الإنجليزي

Designing an early prediction systems-based machine learning model (for diabetes disease ( is an emerging research area, increasing day by day due to the increasing of the diabetes cases all around the world.

Missing values in medical datasets in general, and diabetes disease in particular is an issue faces the machine learning models and case studies.

The imputation method is needed for estimating the missing values is a preprocessing step, should be implemented before classifying the cases in the dataset.

In this study, a new imputation algorithm based on Firefly Algorithm (FA) is proposed, which is called Imputation Algorithm based Firefly Algorithm (IFA).

In order to evaluate the proposed IFA algorithm, a classifier is needed as a fitness function, which generates the classification accuracy of the generated dataset and should be maximized.

Therefore, the accuracy is obtained using three different classifiers: K-Nearest Neighbors (KNN), Support Vector Machine (SVM), and Naïve Bayesian Classifier (NBC).

Pima Indian Diabetes Disease (PIDD) is the main dataset used in this study for estimating the missing values and evaluate IFA.

The proposed algorithm is evaluated using two types of experiments, first experiments validated the generated datasets using k-fold cross validation (K=5).

While the second experiment the validation is done using holdout validation, where the generated dataset is divided into training set (65%) and testing set (35%).

The obtained results showed that the IFA-SVM was ranked the best based the average of ten run times, while IFA-NBC ranked the worst.

Moreover, IFA with all classifiers had the best accuracies as compared to the four popular techniques, which proved that the optimization algorithm as an imputation algorithm is better than the statistical methods in this study.

In conclusion, FA algorithm can be used for estimating missing values PIDD and medical datasets in general.

التخصصات الرئيسية

تكنولوجيا المعلومات وعلم الحاسوب

الموضوعات

عدد الصفحات

100

قائمة المحتويات

Table of contents.

Abstract.

Chapter One : Introduction.

Chapter Two : Background and related works.

Chapter Three : Proposed algorithm.

Chapter Four : Results analysis.

Chapter Five : Conclusion and future works.

References.

نمط استشهاد جمعية علماء النفس الأمريكية (APA)

Husayn, Huda. (2020). Prediction of missing data technique to improve big data classification. (Master's theses Theses and Dissertations Master). Isra University, Jordan
https://search.emarefa.net/detail/BIM-985125

نمط استشهاد الجمعية الأمريكية للغات الحديثة (MLA)

Husayn, Huda. Prediction of missing data technique to improve big data classification. (Master's theses Theses and Dissertations Master). Isra University. (2020).
https://search.emarefa.net/detail/BIM-985125

نمط استشهاد الجمعية الطبية الأمريكية (AMA)

Husayn, Huda. (2020). Prediction of missing data technique to improve big data classification. (Master's theses Theses and Dissertations Master). Isra University, Jordan
https://search.emarefa.net/detail/BIM-985125

لغة النص

الإنجليزية

نوع البيانات

رسائل جامعية

رقم السجل

BIM-985125