![](/images/graphics-bg.png)
Evaluating the effect of preprocessing in Arabic documents clustering
العناوين الأخرى
تقييم تأثير المعالجة المسبقة في عنقدة المستندات العربية
مقدم أطروحة جامعية
مشرف أطروحة جامعية
al-Hanjuri, Muhammad Ahmad Muhammad
أعضاء اللجنة
Abu Haybah, Ibrahim Sulayman
Zaqqut, Ihab Salah al-Din
الجامعة
الجامعة الإسلامية
الكلية
كلية الهندسة
القسم الأكاديمي
قسم هندسة الحاسوب
دولة الجامعة
فلسطين (قطاع غزة)
الدرجة العلمية
ماجستير
تاريخ الدرجة العلمية
2014
الملخص الإنجليزي
Clustering of text documents is an important technique for documents retrieval.
It aims to organize documents into meaningful groups or clusters.
Preprocessing text plays a main role in enhancing clustering process of Arabic documents.
This research examines and compares text preprocessing techniques in Arabic document clustering.
It also studies effectiveness of text preprocessing techniques: term pruning, term weighting using (TF-IDF), morphological analysis techniques using (root-based stemming, light stemming, and raw text), and normalization.
Experimental work examined the effect of clustering algorithms using a most widely used partitional algorithm, K-means, compared with other clustering partitional algorithm, Expectation Maximization (EM) algorithm.
Comparison between the effect of both Euclidean Distance and Manhattan similarity measurement function was attempted in order to produce best results in document clustering.
Results were investigated by measuring evaluation of clustered documents in many cases of preprocessing techniques.
The most frequent and basic measures for text mining evaluation, precision and recall, were used for evaluation measurements.
In addition to F-Measure, which used as a combination of precision and recall.
Experimental results show that evaluation of document clustering can be enhanced by implementing term weighting (TF-IDF) and term pruning with small value for minimum term frequency.
In morphological analysis, light stemming, is found more appropriate than root-based stemming and raw text.
Normalization, also improved clustering process of Arabic documents, and evaluation is enhanced.
Finally, K-means in document clustering was found more efficient than EM algorithm, and Euclidean distance similarity measurement function is superior.
Keywords: Arabic Text Mining, Arabic document clustering, Arabic text preprocessing, Term weighting, Arabic morphological analysis (Arabic stemming / light stemming), Vector Space Mode (VSM), TF-IDF, K-means, EM
التخصصات الرئيسية
الموضوعات
عدد الصفحات
93
قائمة المحتويات
Table of contents.
Abstract.
Abstract in Arabic.
Chapter One : Introduction.
Chapter Two : Related work.
Chapter Three : Background of document clustering.
Chapter Four : Methodology.
Chapter Five : Experimental results and analysis.
Chapter Six : Conclusion and future works.
References.
نمط استشهاد جمعية علماء النفس الأمريكية (APA)
Ghanim, Usamah Abd al-Fattah. (2014). Evaluating the effect of preprocessing in Arabic documents clustering. (Master's theses Theses and Dissertations Master). Islamic University, Palestine (Gaza Strip)
https://search.emarefa.net/detail/BIM-530232
نمط استشهاد الجمعية الأمريكية للغات الحديثة (MLA)
Ghanim, Usamah Abd al-Fattah. Evaluating the effect of preprocessing in Arabic documents clustering. (Master's theses Theses and Dissertations Master). Islamic University. (2014).
https://search.emarefa.net/detail/BIM-530232
نمط استشهاد الجمعية الطبية الأمريكية (AMA)
Ghanim, Usamah Abd al-Fattah. (2014). Evaluating the effect of preprocessing in Arabic documents clustering. (Master's theses Theses and Dissertations Master). Islamic University, Palestine (Gaza Strip)
https://search.emarefa.net/detail/BIM-530232
لغة النص
الإنجليزية
نوع البيانات
رسائل جامعية
رقم السجل
BIM-530232
قاعدة معامل التأثير والاستشهادات المرجعية العربي "ارسيف Arcif"
أضخم قاعدة بيانات عربية للاستشهادات المرجعية للمجلات العلمية المحكمة الصادرة في العالم العربي
![](/images/ebook-kashef.png)
تقوم هذه الخدمة بالتحقق من التشابه أو الانتحال في الأبحاث والمقالات العلمية والأطروحات الجامعية والكتب والأبحاث باللغة العربية، وتحديد درجة التشابه أو أصالة الأعمال البحثية وحماية ملكيتها الفكرية. تعرف اكثر
![](/images/kashef-image.png)