Prediction of missing data technique to improve big data classification

Dissertant

Husayn, Huda

Thesis advisor

al-Harub, Aishah

University

Isra University

Faculty

Faculty of Information Technology

Department

Department Software Engineering

University Country

Jordan

Degree

Master

Degree Date

2020

English Abstract

Designing an early prediction systems-based machine learning model (for diabetes disease ( is an emerging research area, increasing day by day due to the increasing of the diabetes cases all around the world.

Missing values in medical datasets in general, and diabetes disease in particular is an issue faces the machine learning models and case studies.

The imputation method is needed for estimating the missing values is a preprocessing step, should be implemented before classifying the cases in the dataset.

In this study, a new imputation algorithm based on Firefly Algorithm (FA) is proposed, which is called Imputation Algorithm based Firefly Algorithm (IFA).

In order to evaluate the proposed IFA algorithm, a classifier is needed as a fitness function, which generates the classification accuracy of the generated dataset and should be maximized.

Therefore, the accuracy is obtained using three different classifiers: K-Nearest Neighbors (KNN), Support Vector Machine (SVM), and Naïve Bayesian Classifier (NBC).

Pima Indian Diabetes Disease (PIDD) is the main dataset used in this study for estimating the missing values and evaluate IFA.

The proposed algorithm is evaluated using two types of experiments, first experiments validated the generated datasets using k-fold cross validation (K=5).

While the second experiment the validation is done using holdout validation, where the generated dataset is divided into training set (65%) and testing set (35%).

The obtained results showed that the IFA-SVM was ranked the best based the average of ten run times, while IFA-NBC ranked the worst.

Moreover, IFA with all classifiers had the best accuracies as compared to the four popular techniques, which proved that the optimization algorithm as an imputation algorithm is better than the statistical methods in this study.

In conclusion, FA algorithm can be used for estimating missing values PIDD and medical datasets in general.

Main Subjects

Information Technology and Computer Science

Topics

No. of Pages

100

Table of contents.

Abstract.

Chapter One : Introduction.

Chapter Two : Background and related works.

Chapter Three : Proposed algorithm.

Chapter Four : Results analysis.

Chapter Five : Conclusion and future works.

References.

American Psychological Association (APA)

Husayn, Huda. (2020). Prediction of missing data technique to improve big data classification. (Master's theses Theses and Dissertations Master). Isra University, Jordan
https://search.emarefa.net/detail/BIM-985125

Modern Language Association (MLA)

Husayn, Huda. Prediction of missing data technique to improve big data classification. (Master's theses Theses and Dissertations Master). Isra University. (2020).
https://search.emarefa.net/detail/BIM-985125

American Medical Association (AMA)

Language

English

Data Type

Arab Theses

Record ID

BIM-985125

SaveSaved Print

Arab Citation & Impact Factor "Arcif"

Largest Arabic Database of Citations Analysis for the Arabic Scholarly Journals Issued in Arab World.

eMarefa Indicators
for Arab Scientific Production

"Kashif" for Checking Similarity or Plagiarism in the Arabic Researches. know more

e-Marefa