Arabic text classification using dynamic N-Gram

Other Title(s)

تصنيف النصوص العربية باستخدام الانغرام المتغير

Dissertant

al-Amush, Safa Qasim

Thesis advisor

Samawi, Venus W.

University

Al albayt University

Faculty

Prince Hussein Bin Abdullah Faculty for Information Technology

Department

Department of Computer Science

University Country

Jordan

Degree

Master

Degree Date

2013

English Abstract

N-gram is defined as a subsequence of N items from a given sequence.

In case of noisy text problem, N-gram is the ideal solution.

Therefore, we are interested in using Ngram to represent text documents.

In the literature, N-gram refers sometimes to sequences that are not ordered or consecutive.

In this thesis, an N-gram will refer to a chain of N consecutive characters.

Few researches used N as static value for Arabic text classification and information retrieval purposes.

In static N-gram, the text will be segmented to create N-grams with the same length (value of N) such as 3, 4, 5...etc.

The problem of this type of text representation is that, if there is a word or stem with letters less than N character, it will be neglected and considered as a useless word.

For example, if N=4 then all the words which have fewer le?ers than 4 will be neglected.

Our work is concerned with developing an automated system for classifying Arabic text documents by using N-gram as text representation.

We have suggested dynamic Ngram, where N will be determined dynamically (based on word length) to reduce the common grams that may belong to totally different words.

To study the performance of dynamic N-gram (weather it will improve the classification accuracy or not), both traditional static N-gram system and the suggested dynamic N-gram system have been built.

The result of the two systems will be compared from accuracy, recall, precision, and F-measure point of views.

F-measure is a standard statistical measure that is used to measure the performance of a classifier system.

The F-measure is an average parameter based on precision and recall. Our proposed system consists of number of phases: document preprocessing, document feature extraction, construction of the classifier, and document classification.

We have constructed two classifiers: Naïve Bayes (NB) classifier and Dice-measure distance classifier.

Finally, in classification phase, we have evaluated the performance of our proposed system using Diab dataset, and calculated the standards evaluation measurements mentioned above.

The classification results was promising (F-measure=98.87% with Dice-measure classifier).

Also, it is found that the Dicemeasure classifier performs better when dynamic N-gram is used.

Main Subjects

Information Technology and Computer Science

Topics

No. of Pages

52

Table of Contents

Table of contents.

Abstract.

Chapter One : Introduction.

Chapter Two : Literature survey.

Chapter Three : Theoretical background.

Chapter Four : The proposed system phases and experimental results.

Chapter Five : Conclusions and future works.

References.

American Psychological Association (APA)

al-Amush, Safa Qasim. (2013). Arabic text classification using dynamic N-Gram. (Master's theses Theses and Dissertations Master). Al albayt University, Jordan
https://search.emarefa.net/detail/BIM-415991

Modern Language Association (MLA)

al-Amush, Safa Qasim. Arabic text classification using dynamic N-Gram. (Master's theses Theses and Dissertations Master). Al albayt University. (2013).
https://search.emarefa.net/detail/BIM-415991

American Medical Association (AMA)

al-Amush, Safa Qasim. (2013). Arabic text classification using dynamic N-Gram. (Master's theses Theses and Dissertations Master). Al albayt University, Jordan
https://search.emarefa.net/detail/BIM-415991

Language

English

Data Type

Arab Theses

Record ID

BIM-415991