An attempt for clustering arabic text

تاريخ النشر

2007-01-31

دولة النشر

مصر

عدد الصفحات

التخصصات الرئيسية

تكنولوجيا المعلومات وعلم الحاسوب

الموضوعات

الملخص EN

Although of the importance and the wide spread of the Arabic language few work regarding the Arabic language has been done in the area of text mining.

The work in this context aims to cluster similar Arabic Microsoft Documents based on word existence.

The proposed system consists of four stages which are keyword extraction, feature extraction, clustering and finally performance measuring.

Stemmer in is used to extract the most frequent terms along with their frequencies.

In the feature extraction stage, three features are used TF, IDF and TF_IDF.

In the clustering stage, three clustering algorithms are tried which are Simple Similarity-based Clustering, Similarity-based Clustering and Isodata algorithms.

Each clustering algorithm implements two distances; the Euclidean distance and the City-block distance.

The three features are tested using the three models.

The system was built using thirty Arabic documents and was tested using twenty other documents.

The results of the eighteen experiments show Isodata algorithm resulted in the best performance almost all the time.

Isodata algorithms implementing ECD have the best performance.

Also, Isodata has the best generalization.

Finally, the best choice of features is related to distance.

نمط استشهاد جمعية علماء النفس الأمريكية (APA)

Hana, M. A.. 2007. An attempt for clustering arabic text. International Journal of Intelligent Computing and Information Sciences،Vol. 7, no. 1.
https://search.emarefa.net/detail/BIM-284988

نمط استشهاد الجمعية الأمريكية للغات الحديثة (MLA)

Hana, M. A.. An attempt for clustering arabic text. International Journal of Intelligent Computing and Information Sciences Vol. 7, no. 1 (Jan. 2007).
https://search.emarefa.net/detail/BIM-284988

نمط استشهاد الجمعية الطبية الأمريكية (AMA)

Hana, M. A.. An attempt for clustering arabic text. International Journal of Intelligent Computing and Information Sciences. 2007. Vol. 7, no. 1.
https://search.emarefa.net/detail/BIM-284988

نوع البيانات

مقالات

لغة النص

الإنجليزية

الملاحظات

Includes bibliographical references.

رقم السجل

BIM-284988

حفظتم الحفظ طباعة

قاعدة معامل التأثير والاستشهادات المرجعية العربي "ارسيف Arcif"

أضخم قاعدة بيانات عربية للاستشهادات المرجعية للمجلات العلمية المحكمة الصادرة في العالم العربي

مرصد "معرفة"
لقياس الإنتاج العلمي العربي

تقوم هذه الخدمة بالتحقق من التشابه أو الانتحال في الأبحاث والمقالات العلمية والأطروحات الجامعية والكتب والأبحاث باللغة العربية، وتحديد درجة التشابه أو أصالة الأعمال البحثية وحماية ملكيتها الفكرية. تعرف اكثر