An automatic system for extracting diacritic rules for Arabic text based on statistical analysis
Dissertant
Thesis advisor
Comitee Members
al-Shalabi, Riyad
al-Shaykh, Asim A. R.
University
Arab Academy for Financial and Banking Sciences
Faculty
The Faculty of Information Systems and Technology
Department
Computer information systems
University Country
Jordan
Degree
Ph.D.
Degree Date
2013
English Abstract
Diacritics are the short vowels used in Arabic which usually don’t appear in Modern Standard Arabic (MSA) scripts.
Lack of diacritic marks is one of the difficulties that face Arabic NLP researches since it affects on the meaning of the words, and the way it is pronounced.
This research has generated a set of diacritic rules depending on statistical methods.
A group of statistical methods were applied to extract relation between diacritic of a letter and the pattern of adjoining letters.
Previous researches have worked on n-gram models at the word level in order to build Hidden Markov Models (HMM).
In this research we work on n-gram model at the letter level.
Al Quran Al Kareem was used as the pilot data for this research since it can provide full diacritization.
A light stemmer (specially tailored) was used in this research as a preprocess stage to support and enhance the rule generation algorithms.
Also a low level of implementation for the main syntax (Arabic grammar) rules was applied to enhance the coverage ratio as a post stage.
Having simple rules for generating diacritics for Arabic script with acceptable error rate is a need for embedded systems.
Diacritics on Arabic can be divided into two groups: diacritics at the end of each word which mainly depends on the syntax of the sentence, and the Diacritic of each letter the word which will be the main core in this research.
Metrics that will be used for evaluation will be the memory allocation (the size of rules array), and the accuracy degree achieved.
The 4-gram model applied has achieved an accuracy rate of 98.4 % with coverage ratio of 83 %, Also it was found that applying Arabic syntax rules can rise the coverage ratio.
Main Subjects
Information Technology and Computer Science
Arabic language and Literature
Topics
No. of Pages
152
Table of Contents
Table of contents.
Abstract.
Chapter One : Diacritics concepts.
Chapter Two : Arabic Language.
Chapter Three : Literature review.
Chapter Four : Statistical Software model.
Chapter Five : Methods of rule extraction and evaluation.
Chapter Six : Final model.
References.
American Psychological Association (APA)
al-Qassas, Wail Wahid A.. (2013). An automatic system for extracting diacritic rules for Arabic text based on statistical analysis. (Doctoral dissertations Theses and Dissertations Master). Arab Academy for Financial and Banking Sciences, Jordan
https://search.emarefa.net/detail/BIM-404942
Modern Language Association (MLA)
al-Qassas, Wail Wahid A.. An automatic system for extracting diacritic rules for Arabic text based on statistical analysis. (Doctoral dissertations Theses and Dissertations Master). Arab Academy for Financial and Banking Sciences. (2013).
https://search.emarefa.net/detail/BIM-404942
American Medical Association (AMA)
al-Qassas, Wail Wahid A.. (2013). An automatic system for extracting diacritic rules for Arabic text based on statistical analysis. (Doctoral dissertations Theses and Dissertations Master). Arab Academy for Financial and Banking Sciences, Jordan
https://search.emarefa.net/detail/BIM-404942
Language
English
Data Type
Arab Theses
Record ID
BIM-404942