An attempt for clustering arabic text

المؤلف

Hana, M. A.

المصدر

International Journal of Intelligent Computing and Information Sciences

العدد

المجلد 7، العدد 1 (31 يناير/كانون الثاني 2007)13ص.

الناشر

جامعة عين شمس كلية الحاسبات و المعلومات

تاريخ النشر

2007-01-31

دولة النشر

مصر

عدد الصفحات

13

التخصصات الرئيسية

تكنولوجيا المعلومات وعلم الحاسوب

الموضوعات

الملخص EN

Although of the importance and the wide spread of the Arabic language few work regarding the Arabic language has been done in the area of text mining.

The work in this context aims to cluster similar Arabic Microsoft Documents based on word existence.

The proposed system consists of four stages which are keyword extraction, feature extraction, clustering and finally performance measuring.

Stemmer in is used to extract the most frequent terms along with their frequencies.

In the feature extraction stage, three features are used TF, IDF and TF_IDF.

In the clustering stage, three clustering algorithms are tried which are Simple Similarity-based Clustering, Similarity-based Clustering and Isodata algorithms.

Each clustering algorithm implements two distances; the Euclidean distance and the City-block distance.

The three features are tested using the three models.

The system was built using thirty Arabic documents and was tested using twenty other documents.

The results of the eighteen experiments show Isodata algorithm resulted in the best performance almost all the time.

Isodata algorithms implementing ECD have the best performance.

Also, Isodata has the best generalization.

Finally, the best choice of features is related to distance.

نمط استشهاد جمعية علماء النفس الأمريكية (APA)

Hana, M. A.. 2007. An attempt for clustering arabic text. International Journal of Intelligent Computing and Information Sciences،Vol. 7, no. 1.
https://search.emarefa.net/detail/BIM-284988

نمط استشهاد الجمعية الأمريكية للغات الحديثة (MLA)

Hana, M. A.. An attempt for clustering arabic text. International Journal of Intelligent Computing and Information Sciences Vol. 7, no. 1 (Jan. 2007).
https://search.emarefa.net/detail/BIM-284988

نمط استشهاد الجمعية الطبية الأمريكية (AMA)

Hana, M. A.. An attempt for clustering arabic text. International Journal of Intelligent Computing and Information Sciences. 2007. Vol. 7, no. 1.
https://search.emarefa.net/detail/BIM-284988

نوع البيانات

مقالات

لغة النص

الإنجليزية

الملاحظات

Includes bibliographical references.

رقم السجل

BIM-284988