Enhanced Arabic root-based lemmatizer

Other Title(s)

ليميتيزر محسن للجذور العربية

Dissertant

Ata, Halah

Thesis advisor

al-Hammuz, Ahmad

University

Middle East University

Faculty

Faculty of Information Technology

Department

Computer Science Department

University Country

Jordan

Degree

Master

Degree Date

2020

Arabic Abstract

يحتاج أي تطبيق معالجة لغة طبيعية اعادة صياغة الكلمات و تحويلها الى بيانات ذات معنى، كما ان الحاجة إلى التمييز بين الكلمات الأصلية و مشتقاتها في اللغة العربية ضروره و لكنها معقده في طبيعتها، لإن Stemmer و lemmatizer هما أكثر المكونات المطلوبة في تطبيق معالجة اللغة العربية.

فالوظيفة الأساسية Stemmer و lemmatizer هي اعادة الكلمات و مشتقاتها إلى مصدرها المجرد من الزيادات اللغوية و صياغة قاعدة مشتركة.

في هذه الأطروحة، نقترح أداة lemmatizer جديدة بالاستناد إلى قواعد الصرف في اللغة العربية، لغايات تطوير استخدام تطبيقات معالجة اللغة الطبيعية للغة العربية من خلال استخدام الية جديدة لتطبيق قواعد محددة جيدًا تؤدي إلى العثور على جذر الكلمة الصحيحة في اللغة العربية دون استخدام قاموس بالرجوع الى اولويات معده ضمن النموذج المقترح.

و ان نموذجنا المقترح المسمىTassel lemmatizer، هو أول lemmatizer يستغل الأحرف الإضافية الأكثر تكراراً في لكلمة لتحديد القواعد الأولى بالتطبيق وفقًا لمجموعات الأحرف الإضافية الأكثر تكراراً.

قاعدة البيانات المستخدمة.

هي مجموعة من الأمثال في اللغة العربية الفصحى تحتوي على 480 مثال عربي و تتكون من 2493 كلمة منها 1637 كلمة فريدة، باجراء التقييم على النموذج المقترح اظهرت النتائج ان دقة T'assel lemmatizer'هي %74.11.

English Abstract

Generating meaningful information is a big task for any natural language processing application, the need to differentiate between original words and affixes in the Arabic language is important but complex in nature, stemmer and lemmatizer are the most needed components in the Arabic language processing Application.

As the fundamental functionality of stemming and lemmatizing is removing what is called word morphology into a common root or base.

In this thesis, we propose a new rule-based lemmatizer, which aims to enhance the use of natural language processing applications for the Arabic language by implementing welldefined rules which result in finding the word lemma without using a dictionary.

Our proposed model called “T’assel lemmatizer”, is the first lemmatizer which exploit the most frequent extra letters in the word based on priorities established according to the extra letters groups.

The dataset used is a set of proverbs in the standard Arabic language contains 480 proverbs and consists of 2,493 words including 1637 unique words, the accuracy of T’assel lemmatizer was 74.11%.

Main Subjects

Information Technology and Computer Science

Topics

Arabic language

No. of Pages

Table of contents.

Abstract.

Abstract in Arabic.

Chapter One : Introductions.

Chapter Two : Background and literature review.

Chapter Three : Methodology and the proposed model.

Chapter Four : Experimental results and discussion.

Chapter Five : Conclusion and future work.

References.

American Psychological Association (APA)

Ata, Halah. (2020). Enhanced Arabic root-based lemmatizer. (Master's theses Theses and Dissertations Master). Middle East University, Jordan
https://search.emarefa.net/detail/BIM-970872

Modern Language Association (MLA)

Ata, Halah. Enhanced Arabic root-based lemmatizer. (Master's theses Theses and Dissertations Master). Middle East University. (2020).
https://search.emarefa.net/detail/BIM-970872

American Medical Association (AMA)