Arabic word root extraction and automatic categorization of Arabic web documents

مقدم أطروحة جامعية

al-Kurdi, Muhammad

مشرف أطروحة جامعية

Rashidi, T.
Bensaid, A.

الجامعة

جامعة الأخوين

الكلية

كلية الهندسة و العلوم

القسم الأكاديمي

علوم الحاسب

دولة الجامعة

المغرب

الدرجة العلمية

ماجستير

تاريخ الدرجة العلمية

2003

الملخص الإنجليزي

The aim of this thesis is ^o-fold.

Firstly, it proposes a concatenative approach fo^ Arabic ^oot extraction.

This novel approach uses a subset of Arabic words derivations and a subset of Arabic roots.

This approach has been tested on random Arabic web documents, and preliminary results show superior performance to traditional Arabic rule-based root extraction algorithms, and also to existing thesaurus-based stemming algorithms.

The data set consists of 28 (^enty eight) Arabic documents representing distinct topics from different sources.

The obtained results have shown that the performance of this technique reaches 97.15% for a correct root extraction and 3.85% for wrong root extraction.

Secondly, this work evaluates the application of a statistical machine learning algorithm (Naive Bayes) on the automatic categorization of Arabic web documents (after being stemmed using our concatenative approach for Arabic root extraction) to one of five pre-defined categories.

The data set used during these experiments consists of 300 web documents per category.

The results of cross validation show that the categorization accuracy varies from one category to another with an average accuracy over all categories of 68.78 %.

Furthermore, the best categorization performance by category during cross validation experiments goes up to 92.8%.

Further testing is carried out on a manually collected evaluation set, which consists of 10 documents from each of the 5 categories; the overall classification accuracy achieved over all categories is 62% and the best result by category goes up to 90%.

التخصصات الرئيسية

العلوم الهندسية والتكنولوجية (متداخلة التخصصات)
اللغات والآداب المقارنة
اللغة العربية وآدابها

الموضوعات

عدد الصفحات

65

قائمة المحتويات

Table of contents.

Abstract.

Abstract in Arabic.

Introduction.

Chapter One : Stemming and Arabic root extraction.

Chapter Two : Categorization of Arabic web documents.

Chapter Three : Integration of the stemmer and the categorizer onto.

Chapter Four : Conclusions and future works.

References.

نمط استشهاد جمعية علماء النفس الأمريكية (APA)

al-Kurdi, Muhammad. (2003). Arabic word root extraction and automatic categorization of Arabic web documents. (Master's theses Theses and Dissertations Master). Al Akhawayn University, Morocco
https://search.emarefa.net/detail/BIM-644980

نمط استشهاد الجمعية الأمريكية للغات الحديثة (MLA)

al-Kurdi, Muhammad. Arabic word root extraction and automatic categorization of Arabic web documents. (Master's theses Theses and Dissertations Master). Al Akhawayn University. (2003).
https://search.emarefa.net/detail/BIM-644980

نمط استشهاد الجمعية الطبية الأمريكية (AMA)

al-Kurdi, Muhammad. (2003). Arabic word root extraction and automatic categorization of Arabic web documents. (Master's theses Theses and Dissertations Master). Al Akhawayn University, Morocco
https://search.emarefa.net/detail/BIM-644980

لغة النص

الإنجليزية

نوع البيانات

رسائل جامعية

رقم السجل

BIM-644980