Semantic word clustering from large Arabic text

Other Title(s)

العنقدة الدلالية لكلمات النص العربي الكبير

Dissertant

Abu Fayyaz, Tariq Isa Jibriyil

Thesis advisor

Barakah, Ribhi Sulayman

University

Islamic University

Faculty

Faculty of Information Technology

Department

Information Technology

University Country

Palestine (Gaza Strip)

Degree

Master

Degree Date

2018

Arabic Abstract

مع الزيادة السريعة في حجم النص على الويب حيث أصبحت البيانات النصية عالية الأبعاد ( الآلاف من آلاف الكلمات في كل مجال ( و تحمل معلومات دلالية.

هذة الزيادة تطلبت إلى تقنيات تجميع الكلمات التي يمكنها أن تجمع الكلمات إلي مجموعات ذات معني وعلي أساس تشابهها، و التي يمكن أستخدامها في العديد من مهام أسترجاع المعلومات في محركات البحث و خورازميات التصنيف و توسيع أستعلام البحث في هذة الرسالة نقترح أستخدام أداة أو نموذج "word2vec" لبناء المتجة التمثيلي لكلمات النص العربي الكبير و التي سوف تعطي معاني و مميزات دلالية في بناء مجموعات دلالية من كلمات النص العربي الكبير.

و هذا يتضمن المعالجة المسبقة للنص، بناء المتجة التمثيلي بأستخدام نموذج "word2vec "، بناء نموذج التصنيف و المجموعات الدلالية بإستخدام طريقة "Pipeline" و خورازمية التصنيف" Extra tree classifier" تم قمنا بأخذ النص الذي تم معالجتة و مصفوفة تردد المصطلحات لبناء مصنف المتجهات بإستخدام خورازمية التصنيف "Extra tree classifier" و أستخدامة في تصنيف و تنبؤ الكلمات إلي الفئات المحددة مسبقا.

قمنا بتطبيق نموذج التصنيف و إجراء تجارب عديدة بإستخدام النموذج، حيث أن النتائج أظهرت إلى فعالية النموذج لإنشاء مجموعات دلالية من النص العربي الكبير.

و تظهر نتائج التصنيف إلي أن السمات المستخرجة من كلمات المتجهات قد مكنت نموذج التصنيف من تحقيق دقة عالية و صحة بأكثر من %85.

كما أن النتائج تشير إلي أن نموذج التصنيف لا يخضع إلي حالة "under fitting" ( أى أن النموذج لايؤدي أداء ضعيفا علي بيانات التدريب )، و أيضا لا يخضع إلي حالة "over fitting" ( أي أن النموذج يؤدي أداء جيدا علي كل من بيانات التدريب و الإختبار ).

English Abstract

With the rapid increase of text volume on the web, textual data becomes high-dimensional (thousands of thousands of words in each domain) and carry semantic information.

This raise the need for word clustering techniques that can cluster words into meaningful groups based on their similarity, which can be used in various information retrieval tasks, like question answering systems, search engines, classification algorithms and search query expansion.

In this thesis, we use word2vec model to build Arabic vector representations of words that brings extra semantic features to help building clusters of semantically related words.

This involves text pre-processing, creating word vectors using word2vec model, generating the classification model and creating word clusters using Pipeline method and Extra tree classifier.

We have taken the transformed text, the term frequency matrix and learn to classify the vectors with Extra Tree classifier, then we classifying and predicting the training data into the pre-defined categories.

We have implemented the model and performed a set of experiments.

The experiments results show the effectiveness of the model to create word clusters from a large plain Arabic text.

The classification results show that the extracted features from the word vectors have empowered the classification models and achieved accuracy, precision, recall and Fmeasure with higher than 85%.

The results also indicate that the classification model is not being under fitting (i.e., the model does not perform poorly on the training data) and it is also not being over fitting (i.e., the model performs well on both; the training data and the testing data).

Main Subjects

Information Technology and Computer Science

Topics

Wikipedia

No. of Pages

Table of contents.

Abstract.

Abstract in Arabic.

Chapter One : Introduction.

Chapter Two : Theoretical and technical foundation.

Chapter Three : Related works.

Chapter Four : Clustering words semantically.

Chapter Five : Experimental results and evaluation.

Chapter Six : Conclusion and future work.

References.

American Psychological Association (APA)

Abu Fayyaz, Tariq Isa Jibriyil. (2018). Semantic word clustering from large Arabic text. (Master's theses Theses and Dissertations Master). Islamic University, Palestine (Gaza Strip)
https://search.emarefa.net/detail/BIM-905917

Modern Language Association (MLA)

Abu Fayyaz, Tariq Isa Jibriyil. Semantic word clustering from large Arabic text. (Master's theses Theses and Dissertations Master). Islamic University. (2018).
https://search.emarefa.net/detail/BIM-905917

American Medical Association (AMA)

Language

English

Data Type

Arab Theses

Record ID

BIM-905917

SaveSaved Print

Arab Citation & Impact Factor "Arcif"

Largest Arabic Database of Citations Analysis for the Arabic Scholarly Journals Issued in Arab World.

e-Marefa Platform for Arabic Textbook.

"Kashif" for Checking Similarity or Plagiarism in the Arabic Researches. know more