Parallel text classification applied to large scale Arabic text

Other Title(s)

تصنيف النصوص العربية ذات النطاق الواسع على التوازي

Dissertant

al-Qarut, Bushra Umar Ali

Thesis advisor

Barakah, Ribhi Sulayman

University

Islamic University

Faculty

Faculty of Information Technology

Department

Information Technology

University Country

Palestine (Gaza Strip)

Degree

Master

Degree Date

2017

English Abstract

Arabic text classification is becoming the focus of research and study for many researchers interested in Arabic text mining field especially with the rapid grow of Arabic content on the web.

In this research, Naïve Bayes (NB) and Logistic Regression (LR) are used for Arabic text classification in parallel.

When these algorithms are used for classification in a sequential manner, they have high cost and low performance.

Naïve Bayes cost a lot of computations and time when it is applied on large scale datasets in size and feature dimensionality.

On the other hand, logistic regression has iterative computations which cost heavy time and memory.

Also, both algorithms do not give satisfying accuracy and efficiency rates especially with large Arabic dataset taking into account that Arabic language has complex morphology adding complexities to the computing cost.

Therefore, in order to overcome the above limitations, these algorithms must be redesigned and implemented in parallel.

In this research, we design and implement parallelized Naïve Bayes and Logistic Regression algorithms for large-scale Arabic text classification.

Large-scale Arabic text corpuses are collected and created.

This is followed by performing the proper text preprocessing tasks to present the text in appropriate representation for classification in two phases: sequential text preprocessing and term weighting with TF-IDF in parallel.

The parallelized NB and LR algorithms are designed based on MapReduce model and executed using Apache Spark in-memory for big data processing.

Various experiments are conducted on a standalone machine and on a computer clusters of 2, 4, 8, and 16 nodes.

The results of these experiments are collected and analysed.

We found that applying stemming approach reduced dataset documents’ sizes and affects the classification accuracy where root stemming gets more accurate results than light (light1) stemming.

For fast results, NB is suitable and returns high accuracy rates around 99% for large-scale documents with high dimensionality.

LR also gives accurate results except it takes longer time than NB.

It gives 93% accuracy for AlBokhary corpus compared to NB which gives 89% accuracy for the same corpus.

Main Subjects

Information Technology and Computer Science

No. of Pages

75

Table of Contents

Table of contents.

Abstract.

Abstract in Arabic.

Chapter One : Introduction.

Chapter Two : Theoretical and technical foundation.

Chapter Three : Related works.

Chapter Four : Parallel classification of Arabic text using NB and LR.

Chapter Five : Experimental results and approach evaluation.

Chapter Six : Conclusion and future work.

References.

American Psychological Association (APA)

al-Qarut, Bushra Umar Ali. (2017). Parallel text classification applied to large scale Arabic text. (Master's theses Theses and Dissertations Master). Islamic University, Palestine (Gaza Strip)
https://search.emarefa.net/detail/BIM-905421

Modern Language Association (MLA)

al-Qarut, Bushra Umar Ali. Parallel text classification applied to large scale Arabic text. (Master's theses Theses and Dissertations Master). Islamic University. (2017).
https://search.emarefa.net/detail/BIM-905421

American Medical Association (AMA)

al-Qarut, Bushra Umar Ali. (2017). Parallel text classification applied to large scale Arabic text. (Master's theses Theses and Dissertations Master). Islamic University, Palestine (Gaza Strip)
https://search.emarefa.net/detail/BIM-905421

Language

English

Data Type

Arab Theses

Record ID

BIM-905421