An automatic system for extracting diacritic rules for Arabic text based on statistical analysis

Dissertant

al-Qassas, Wail Wahid A.

Thesis advisor

Kanan, Ghassan

Comitee Members

al-Shalabi, Riyad
al-Shaykh, Asim A. R.

University

Arab Academy for Financial and Banking Sciences

Faculty

The Faculty of Information Systems and Technology

Department

Computer information systems

University Country

Jordan

Degree

Ph.D.

Degree Date

2013

English Abstract

Diacritics are the short vowels used in Arabic which usually don’t appear in Modern Standard Arabic (MSA) scripts.

Lack of diacritic marks is one of the difficulties that face Arabic NLP researches since it affects on the meaning of the words, and the way it is pronounced.

This research has generated a set of diacritic rules depending on statistical methods.

A group of statistical methods were applied to extract relation between diacritic of a letter and the pattern of adjoining letters.

Previous researches have worked on n-gram models at the word level in order to build Hidden Markov Models (HMM).

In this research we work on n-gram model at the letter level.

Al Quran Al Kareem was used as the pilot data for this research since it can provide full diacritization.

A light stemmer (specially tailored) was used in this research as a preprocess stage to support and enhance the rule generation algorithms.

Also a low level of implementation for the main syntax (Arabic grammar) rules was applied to enhance the coverage ratio as a post stage.

Having simple rules for generating diacritics for Arabic script with acceptable error rate is a need for embedded systems.

Diacritics on Arabic can be divided into two groups: diacritics at the end of each word which mainly depends on the syntax of the sentence, and the Diacritic of each letter the word which will be the main core in this research.

Metrics that will be used for evaluation will be the memory allocation (the size of rules array), and the accuracy degree achieved.

The 4-gram model applied has achieved an accuracy rate of 98.4 % with coverage ratio of 83 %, Also it was found that applying Arabic syntax rules can rise the coverage ratio.

Main Subjects

Information Technology and Computer Science
Arabic language and Literature

Topics

No. of Pages

152

Table of Contents

Table of contents.

Abstract.

Chapter One : Diacritics concepts.

Chapter Two : Arabic Language.

Chapter Three : Literature review.

Chapter Four : Statistical Software model.

Chapter Five : Methods of rule extraction and evaluation.

Chapter Six : Final model.

References.

American Psychological Association (APA)

al-Qassas, Wail Wahid A.. (2013). An automatic system for extracting diacritic rules for Arabic text based on statistical analysis. (Doctoral dissertations Theses and Dissertations Master). Arab Academy for Financial and Banking Sciences, Jordan
https://search.emarefa.net/detail/BIM-404942

Modern Language Association (MLA)

al-Qassas, Wail Wahid A.. An automatic system for extracting diacritic rules for Arabic text based on statistical analysis. (Doctoral dissertations Theses and Dissertations Master). Arab Academy for Financial and Banking Sciences. (2013).
https://search.emarefa.net/detail/BIM-404942

American Medical Association (AMA)

al-Qassas, Wail Wahid A.. (2013). An automatic system for extracting diacritic rules for Arabic text based on statistical analysis. (Doctoral dissertations Theses and Dissertations Master). Arab Academy for Financial and Banking Sciences, Jordan
https://search.emarefa.net/detail/BIM-404942

Language

English

Data Type

Arab Theses

Record ID

BIM-404942