Automatic clustering of Arabic documents

العناوين الأخرى

التنجميع التلقائي للوثائق العربية

الجامعة

جامعة الأميرة سمية للتكنولوجيا

الكلية

كلية الملك الحسين لعلوم الحوسبة

القسم الأكاديمي

قسم علم الحاسوب

دولة الجامعة

الأردن

الدرجة العلمية

ماجستير

تاريخ الدرجة العلمية

2012

الملخص الإنجليزي

Arabic Documents Clustering is an important task for obtaining good results in traditional Information Retrieval systems especially with the rapid growth of the number of online documents existed in Arabic language.

Document clustering aims to automatically group similar documents into clusters so that each document in a set are similar to other documents in the same set and dissimilar to documents in other sets.

In this study we will discuss Natural languages processing and their applications then we will study current clustering algorithms and clustering of Arabic documents and their problems which are : stemming vs.

rooting, synonyms and homonyms, antonyms when removing stop words, words order and weighting, strange words and names and sentence level meaning.

Then we proposed solutions and new techniques to improve the clustering quality and solve current problems, the proposed solutions are : using a combination of all stems and all roots to represent the document on all level of analysis, implement a dictionary to find all the synonyms and possible meanings for all the words in the documents set and to find the antonyms of the word if it is preceded by negation, using N-Grams (unigram, bigram and trigram) instead of tokenization, and to use weights for words based on term frequency inverse document frequency instead of counting the frequency of words only, the improvements applied on a non-hierarchal approach.

The study concludes that these problems can be solved totally or partially by using the suggested solutions and that the clustering quality can be improved by up to 55 % than the traditional clustering (which is based on five stages preprocessing only) when using the suggested improvements and techniques.

التخصصات الرئيسية

تكنولوجيا المعلومات وعلم الحاسوب

الموضوعات

عدد الصفحات

129

قائمة المحتويات

Table of contents.

Abstract.

Chapter One : introduction.

Chapter Two : natural languages processing.

Chapter Three : documents clustering.

Chapter Four : arabic documents clustering.

Chapter Five : analysis and results.

References.

نمط استشهاد جمعية علماء النفس الأمريكية (APA)

al-Saadi, Sami Abd al-Karim. (2012). Automatic clustering of Arabic documents. (Master's theses Theses and Dissertations Master). Princess Sumaya University for Technology, Jordan
https://search.emarefa.net/detail/BIM-305146

نمط استشهاد الجمعية الأمريكية للغات الحديثة (MLA)

al-Saadi, Sami Abd al-Karim. Automatic clustering of Arabic documents. (Master's theses Theses and Dissertations Master). Princess Sumaya University for Technology. (2012).
https://search.emarefa.net/detail/BIM-305146

نمط استشهاد الجمعية الطبية الأمريكية (AMA)

لغة النص

الإنجليزية

نوع البيانات

رسائل جامعية

رقم السجل

BIM-305146

حفظتم الحفظ طباعة

قاعدة معامل التأثير والاستشهادات المرجعية العربي "ارسيف Arcif"

أضخم قاعدة بيانات عربية للاستشهادات المرجعية للمجلات العلمية المحكمة الصادرة في العالم العربي

مرصد "معرفة"
لقياس الإنتاج العلمي العربي

تقوم هذه الخدمة بالتحقق من التشابه أو الانتحال في الأبحاث والمقالات العلمية والأطروحات الجامعية والكتب والأبحاث باللغة العربية، وتحديد درجة التشابه أو أصالة الأعمال البحثية وحماية ملكيتها الفكرية. تعرف اكثر