Parallel text classification applied to large scale Arabic text
Other Title(s)
تصنيف النصوص العربية ذات النطاق الواسع على التوازي
Dissertant
Thesis advisor
University
Islamic University
Faculty
Faculty of Information Technology
Department
Information Technology
University Country
Palestine (Gaza Strip)
Degree
Master
Degree Date
2017
English Abstract
Arabic text classification is becoming the focus of research and study for many researchers interested in Arabic text mining field especially with the rapid grow of Arabic content on the web.
In this research, Naïve Bayes (NB) and Logistic Regression (LR) are used for Arabic text classification in parallel.
When these algorithms are used for classification in a sequential manner, they have high cost and low performance.
Naïve Bayes cost a lot of computations and time when it is applied on large scale datasets in size and feature dimensionality.
On the other hand, logistic regression has iterative computations which cost heavy time and memory.
Also, both algorithms do not give satisfying accuracy and efficiency rates especially with large Arabic dataset taking into account that Arabic language has complex morphology adding complexities to the computing cost.
Therefore, in order to overcome the above limitations, these algorithms must be redesigned and implemented in parallel.
In this research, we design and implement parallelized Naïve Bayes and Logistic Regression algorithms for large-scale Arabic text classification.
Large-scale Arabic text corpuses are collected and created.
This is followed by performing the proper text preprocessing tasks to present the text in appropriate representation for classification in two phases: sequential text preprocessing and term weighting with TF-IDF in parallel.
The parallelized NB and LR algorithms are designed based on MapReduce model and executed using Apache Spark in-memory for big data processing.
Various experiments are conducted on a standalone machine and on a computer clusters of 2, 4, 8, and 16 nodes.
The results of these experiments are collected and analysed.
We found that applying stemming approach reduced dataset documents’ sizes and affects the classification accuracy where root stemming gets more accurate results than light (light1) stemming.
For fast results, NB is suitable and returns high accuracy rates around 99% for large-scale documents with high dimensionality.
LR also gives accurate results except it takes longer time than NB.
It gives 93% accuracy for AlBokhary corpus compared to NB which gives 89% accuracy for the same corpus.
Main Subjects
Information Technology and Computer Science
No. of Pages
75
Table of Contents
Table of contents.
Abstract.
Abstract in Arabic.
Chapter One : Introduction.
Chapter Two : Theoretical and technical foundation.
Chapter Three : Related works.
Chapter Four : Parallel classification of Arabic text using NB and LR.
Chapter Five : Experimental results and approach evaluation.
Chapter Six : Conclusion and future work.
References.
American Psychological Association (APA)
al-Qarut, Bushra Umar Ali. (2017). Parallel text classification applied to large scale Arabic text. (Master's theses Theses and Dissertations Master). Islamic University, Palestine (Gaza Strip)
https://search.emarefa.net/detail/BIM-905421
Modern Language Association (MLA)
al-Qarut, Bushra Umar Ali. Parallel text classification applied to large scale Arabic text. (Master's theses Theses and Dissertations Master). Islamic University. (2017).
https://search.emarefa.net/detail/BIM-905421
American Medical Association (AMA)
al-Qarut, Bushra Umar Ali. (2017). Parallel text classification applied to large scale Arabic text. (Master's theses Theses and Dissertations Master). Islamic University, Palestine (Gaza Strip)
https://search.emarefa.net/detail/BIM-905421
Language
English
Data Type
Arab Theses
Record ID
BIM-905421