An automatic system for extracting diacritic rules for Arabic text based on statistical analysis

مقدم أطروحة جامعية

al-Qassas, Wail Wahid A.

مشرف أطروحة جامعية

Kanan, Ghassan

أعضاء اللجنة

al-Shalabi, Riyad
al-Shaykh, Asim A. R.

الجامعة

الأكاديمية العربية للعلوم المالية و المصرفية

الكلية

كلية نظم و تكنولوجيا المعلومات

القسم الأكاديمي

قسم نظم المعلومات الحاسوبية

دولة الجامعة

الأردن

الدرجة العلمية

دكتوراه

تاريخ الدرجة العلمية

2013

الملخص الإنجليزي

Diacritics are the short vowels used in Arabic which usually don’t appear in Modern Standard Arabic (MSA) scripts.

Lack of diacritic marks is one of the difficulties that face Arabic NLP researches since it affects on the meaning of the words, and the way it is pronounced.

This research has generated a set of diacritic rules depending on statistical methods.

A group of statistical methods were applied to extract relation between diacritic of a letter and the pattern of adjoining letters.

Previous researches have worked on n-gram models at the word level in order to build Hidden Markov Models (HMM).

In this research we work on n-gram model at the letter level.

Al Quran Al Kareem was used as the pilot data for this research since it can provide full diacritization.

A light stemmer (specially tailored) was used in this research as a preprocess stage to support and enhance the rule generation algorithms.

Also a low level of implementation for the main syntax (Arabic grammar) rules was applied to enhance the coverage ratio as a post stage.

Having simple rules for generating diacritics for Arabic script with acceptable error rate is a need for embedded systems.

Diacritics on Arabic can be divided into two groups: diacritics at the end of each word which mainly depends on the syntax of the sentence, and the Diacritic of each letter the word which will be the main core in this research.

Metrics that will be used for evaluation will be the memory allocation (the size of rules array), and the accuracy degree achieved.

The 4-gram model applied has achieved an accuracy rate of 98.4 % with coverage ratio of 83 %, Also it was found that applying Arabic syntax rules can rise the coverage ratio.

التخصصات الرئيسية

تكنولوجيا المعلومات وعلم الحاسوب
اللغة العربية وآدابها

الموضوعات

عدد الصفحات

152

قائمة المحتويات

Table of contents.

Abstract.

Chapter One : Diacritics concepts.

Chapter Two : Arabic Language.

Chapter Three : Literature review.

Chapter Four : Statistical Software model.

Chapter Five : Methods of rule extraction and evaluation.

Chapter Six : Final model.

References.

نمط استشهاد جمعية علماء النفس الأمريكية (APA)

al-Qassas, Wail Wahid A.. (2013). An automatic system for extracting diacritic rules for Arabic text based on statistical analysis. (Doctoral dissertations Theses and Dissertations Master). Arab Academy for Financial and Banking Sciences, Jordan
https://search.emarefa.net/detail/BIM-404942

نمط استشهاد الجمعية الأمريكية للغات الحديثة (MLA)

al-Qassas, Wail Wahid A.. An automatic system for extracting diacritic rules for Arabic text based on statistical analysis. (Doctoral dissertations Theses and Dissertations Master). Arab Academy for Financial and Banking Sciences. (2013).
https://search.emarefa.net/detail/BIM-404942

نمط استشهاد الجمعية الطبية الأمريكية (AMA)

al-Qassas, Wail Wahid A.. (2013). An automatic system for extracting diacritic rules for Arabic text based on statistical analysis. (Doctoral dissertations Theses and Dissertations Master). Arab Academy for Financial and Banking Sciences, Jordan
https://search.emarefa.net/detail/BIM-404942

لغة النص

الإنجليزية

نوع البيانات

رسائل جامعية

رقم السجل

BIM-404942