The impact of text preprocessing and term weighting on Arabic text classification

Other Title(s)

أثر معالجة النصوص و توزين الكلمات على تصنيف النصوص العربية

Dissertant

Sad, Mutazz Khalid

Thesis advisor

Ashur, Wisam

Comitee Members

Abu Haybah, Ibrahim S. I.
al-Halis, Ala Mustafa

University

Islamic University

Faculty

Faculty of Engineering

Department

Department of Computer Engineering

University Country

Palestine (Gaza Strip)

Degree

Master

Degree Date

2010

English Abstract

This research presents and compares the impact of text preprocessing, which has not been addressed before, on Arabic text classification using popular text classification algorithms; Decision Tree, K Nearest Neighbors, Support Vector Machines, Naïve Bayes and its variations.

Text preprocessing includes applying different term weighting schemes, and Arabic morphological analysis (stemming and light stemming).

We implemented and integrated Arabic morphological analysis tools within the leading open source machine learning tools : Weka, and RapidMiner.

Text Classification algorithms are applied on seven Arabic corpora (3 in-house collected and 4 existing corpora).

Experimental results show : (1) Light stemming with term pruning is best feature reduction technique.

(2) Support Vector Machines and Naïve Bayes variations outperform other algorithms.

(3) Weighting schemes impact the performance of distance based classifier.

Main Subjects

Information Technology and Computer Science

Topics

No. of Pages

100

Table of Contents

Table of contents.

Abstract.

Chapter 1 : Introduction.

Chapter 2 : Related work.

Chapter 3 : Text classifiers.

Chapter 4 : Text preprocessing.

Chapter 5 : Corpora.

Chapter 6 : Experimental results and analysis.

Chapter 7 : Conclusion and future work.

American Psychological Association (APA)

Sad, Mutazz Khalid. (2010). The impact of text preprocessing and term weighting on Arabic text classification. (Master's theses Theses and Dissertations Master). Islamic University, Palestine (Gaza Strip)
https://search.emarefa.net/detail/BIM-300841

Modern Language Association (MLA)

Sad, Mutazz Khalid. The impact of text preprocessing and term weighting on Arabic text classification. (Master's theses Theses and Dissertations Master). Islamic University. (2010).
https://search.emarefa.net/detail/BIM-300841

American Medical Association (AMA)

Sad, Mutazz Khalid. (2010). The impact of text preprocessing and term weighting on Arabic text classification. (Master's theses Theses and Dissertations Master). Islamic University, Palestine (Gaza Strip)
https://search.emarefa.net/detail/BIM-300841

Language

English

Data Type

Arab Theses

Record ID

BIM-300841