Arabic word root extraction and automatic categorization of Arabic web documents

Dissertant

al-Kurdi, Muhammad

Thesis advisor

Rashidi, T.
Bensaid, A.

University

Al Akhawayn University

Faculty

School of Science and Engineering

Department

Computer Science

University Country

Morocco

Degree

Master

Degree Date

2003

English Abstract

The aim of this thesis is ^o-fold.

Firstly, it proposes a concatenative approach fo^ Arabic ^oot extraction.

This novel approach uses a subset of Arabic words derivations and a subset of Arabic roots.

This approach has been tested on random Arabic web documents, and preliminary results show superior performance to traditional Arabic rule-based root extraction algorithms, and also to existing thesaurus-based stemming algorithms.

The data set consists of 28 (^enty eight) Arabic documents representing distinct topics from different sources.

The obtained results have shown that the performance of this technique reaches 97.15% for a correct root extraction and 3.85% for wrong root extraction.

Secondly, this work evaluates the application of a statistical machine learning algorithm (Naive Bayes) on the automatic categorization of Arabic web documents (after being stemmed using our concatenative approach for Arabic root extraction) to one of five pre-defined categories.

The data set used during these experiments consists of 300 web documents per category.

The results of cross validation show that the categorization accuracy varies from one category to another with an average accuracy over all categories of 68.78 %.

Furthermore, the best categorization performance by category during cross validation experiments goes up to 92.8%.

Further testing is carried out on a manually collected evaluation set, which consists of 10 documents from each of the 5 categories; the overall classification accuracy achieved over all categories is 62% and the best result by category goes up to 90%.

Main Subjects

Engineering & Technology Sciences (Multidisciplinary)
Languages & Comparative Literature
Arabic language and Literature

Topics

No. of Pages

65

Table of Contents

Table of contents.

Abstract.

Abstract in Arabic.

Introduction.

Chapter One : Stemming and Arabic root extraction.

Chapter Two : Categorization of Arabic web documents.

Chapter Three : Integration of the stemmer and the categorizer onto.

Chapter Four : Conclusions and future works.

References.

American Psychological Association (APA)

al-Kurdi, Muhammad. (2003). Arabic word root extraction and automatic categorization of Arabic web documents. (Master's theses Theses and Dissertations Master). Al Akhawayn University, Morocco
https://search.emarefa.net/detail/BIM-644980

Modern Language Association (MLA)

al-Kurdi, Muhammad. Arabic word root extraction and automatic categorization of Arabic web documents. (Master's theses Theses and Dissertations Master). Al Akhawayn University. (2003).
https://search.emarefa.net/detail/BIM-644980

American Medical Association (AMA)

al-Kurdi, Muhammad. (2003). Arabic word root extraction and automatic categorization of Arabic web documents. (Master's theses Theses and Dissertations Master). Al Akhawayn University, Morocco
https://search.emarefa.net/detail/BIM-644980

Language

English

Data Type

Arab Theses

Record ID

BIM-644980