Document classification method based on contents using an improved multinomial naïve Bayes model

Other Title(s)

طريقة تصنيف الوثيقة استنادا إلى محتوياتها باستخدام تحسين نموذج متعدد الحدود نيف بايز

Dissertant

al-Bayati, Junaina Jamil Najm al-Din

Thesis advisor

al-Husayni, Muhammad Abbas Fadil

Comitee Members

al-Jarrah, Muzaffar
Kanan, Ghassan Ghazi

University

Middle East University

Faculty

Faculty of Information Technology

Department

Computer Science Department

University Country

Jordan

Degree

Master

Degree Date

2015

English Abstract

Currently, there are a lot of Arabic documents that are available in the most of applications in our lives, these Arabic documents have to be systematized and categorized according to a particular topic to be more expressive and more employed, the text classification was one of the approaches that used to arranged the Arabic documents, where the classifications of the Arabic documents were the technique to determine for which topic this text is related to, numerous studies were accompanied about this discipline to increase the performance of the document classification particularly the Arabic document, the Arabic linguistic is treasure and an actual complex inflectional language that changes the modest and normal approaches to difficult one .

This research involved in improving and promoting the performance of the multinomial naive Bayes (MNB) classification by using three different approaches; at first by addition only the n-gram, the another one by applied the TF-IDF, and lastly by using both of n-gram and TF-IDF, then these improved classifiers had been evaluated based on the estimated values of the recall, precision and F-measure for each classifier next to apply it over the Arabic data set that covers six classes which involved about 1500 arabic document dissimilar document.

The average of F-measure for all classes when applying the bigram was (81.46%), while the average of F-measure for all classes when applying TF-IDF was (88.88%) and the average of F-measure for all classes when applying the combination of both bigram and TF-IDF was (89.70%).

The variance F-measure between different suggested classifiers verified that the classifier which is enhanced by using both of the TF-IDF and bigram accomplished the highest values and it characterizes as the most effective classifier between the three suggested classifier.

In the second stage of effectiveness, the classifier that enhanced by using only TF-IDF and finally the classifier which enhanced by using only the bigram.

Keywords: Multinomial Naïve Bayes, TF-IDF(Term Frequency-Inverse Document Frequency), N-gram , Data Set Arabic, Tokenization, Stemming, Remove Stop Words .

Main Subjects

Information Technology and Computer Science

No. of Pages

74

Table of Contents

Table of contents.

Abstract.

Abstract in Arabic.

Chapter One : Introduction.

Chapter Two : Literature review.

Chapter Three : Methodology and proposed models.

Chapter Four : The results.

Chapter Five : The evaluation.

Chapter Six : Conclusion and future work.

References.

American Psychological Association (APA)

al-Bayati, Junaina Jamil Najm al-Din. (2015). Document classification method based on contents using an improved multinomial naïve Bayes model. (Master's theses Theses and Dissertations Master). Middle East University, Jordan
https://search.emarefa.net/detail/BIM-698776

Modern Language Association (MLA)

al-Bayati, Junaina Jamil Najm al-Din. Document classification method based on contents using an improved multinomial naïve Bayes model. (Master's theses Theses and Dissertations Master). Middle East University. (2015).
https://search.emarefa.net/detail/BIM-698776

American Medical Association (AMA)

al-Bayati, Junaina Jamil Najm al-Din. (2015). Document classification method based on contents using an improved multinomial naïve Bayes model. (Master's theses Theses and Dissertations Master). Middle East University, Jordan
https://search.emarefa.net/detail/BIM-698776

Language

English

Data Type

Arab Theses

Record ID

BIM-698776