Novel approach to vocalized Arabic words root extraction : application to automatic Arabic text summarization

Dissertant

Muhammadi, Munir

Thesis advisor

Shukayri, Abd Allah
Rashidi, Taj al-Din

University

Al Akhawayn University

Faculty

School of Science and Engineering

Department

Computer Science

University Country

Morocco

Degree

Master

Degree Date

2006

English Abstract

This thesis has a twofold objective: (i) to propose, implement, and evaluate a novel approach for extracting roots of vocalized Arabic words and (ii) to develop and assess a graph-based technique that uses the developed root extraction algorithm for automatically generating Arabic document summaries.

The proposed Vocalized Arabic Word Root Extraction (VAWRE) approach is a continuation of previous research conducted at the Arabic Computing research laboratory at Al Akhawayn University for the development of an Arabic stemmer [1] that has been integrated onto Barq search engine [2].

The approach is a result of a linguistic analysis of the non-concatenative morphology and the complex orthography of the Arabic language.

The VAWRE algorithm uses a manually constructed thesaurus of 8,950 Arabic roots and a maintained list of 45 morphological template sets [3].

The constructed root thesaurus along with the list of morphological template sets covers all most frequent words that appear in Arabic modern text.

The algorithm extracts the most precise root (or the set of all possible roots in case of ambiguity), rather than stems.

The approach makes use of diacritic marks, which are used in the Arabic language mainly as short vowels, for the purpose of reducing the identified root ambiguities and hence, enhancing the root extraction precision.

Moreover, it provides enough flexibility to handle fully vocalized, partially vocalized and non-vocalized words, so as to cope with the recognizable lack of a standardized punctuation model in modern Arabic texts.

The implemented approach has been tested on evaluation corpora, which consist of 258 Arabic text documents collected from the Web.

The obtained results have shown that the VAWRE algorithm achieved an overall performance of 85% and an average root extraction correctness of 77%.

Moreover, the results have proven that the use of vocalization in root extraction achieves an average root ambiguity reduction of 33%.

The aim of the second fold of this thesis is to develop a technique for automatically generating Arabic document summaries.

The motivation behind developing this automatic summarization system is to construct a simple graph-based platform that will be further enhanced for the purpose of measuring the impact of integrating the developed root extraction algorithm on the retrieval effectiveness.

From our perspective and in the context of text summarization, the retrieval effectiveness refers to the capability of extracting the most important / central sentences in a text.

The developed Automatic Arabic Text Summarizer (AATS) is a graph-based technique that uses the roots extracted by the VAWRE algorithm to enhance the content overlap measures among sentences.

The central idea of our summarization technique is to construct a fully-connected and weighted graph that represents the document to be summarized.

The graph vertices represent the extracted sentences and the weighted edges reflect the document’s sentences interconnection in terms of semantic similarity.

The accuracy of inter-sentence similarity measure is improved by considering identified shared key words and common morphological roots obtained by processing the document words using the VAWRE algorithm.

An adapted version of the Google’s PageRank ranking algorithm [4] is applied to the constructed graph to retrieve the most important or central sentences that will make the document summary.

The developed AATS have been assessed in terms of precision and recall with respect to the summarization results obtained from Sakhr Text Summarizer [5] for 10 Arabic text documents.

The obtained results have shown that the developed AATS achieves a high precision of 90% and a recall of 65%.

In addition, we have concluded that the AATS algorithm achieves a higher precision and recall when giving more weight to shared roots with respect to the weight of shared words.

Main Subjects

Languages & Comparative Literature
Information Technology and Computer Science
Arabic language and Literature

Topics

No. of Pages

55

Table of Contents

Table of contents.

Abstract.

Abstract in Arabic.

Chapter One : Introduction.

Chapter Two : Novel approach to vocalized Arabic word root extraction.

Chapter Three : Graph-based technique for automatic Arabic text summarization.

Chapter Four : Conclusions and future works.

References.

American Psychological Association (APA)

Muhammadi, Munir. (2006). Novel approach to vocalized Arabic words root extraction : application to automatic Arabic text summarization. (Master's theses Theses and Dissertations Master). Al Akhawayn University, Morocco
https://search.emarefa.net/detail/BIM-646788

Modern Language Association (MLA)

Muhammadi, Munir. Novel approach to vocalized Arabic words root extraction : application to automatic Arabic text summarization. (Master's theses Theses and Dissertations Master). Al Akhawayn University. (2006).
https://search.emarefa.net/detail/BIM-646788

American Medical Association (AMA)

Muhammadi, Munir. (2006). Novel approach to vocalized Arabic words root extraction : application to automatic Arabic text summarization. (Master's theses Theses and Dissertations Master). Al Akhawayn University, Morocco
https://search.emarefa.net/detail/BIM-646788

Language

English

Data Type

Arab Theses

Record ID

BIM-646788