An attempt for clustering arabic text
Author
Source
International Journal of Intelligent Computing and Information Sciences
Issue
Vol. 7, Issue 1 (31 Jan. 2007)13 p.
Publisher
Ain Shams University Faculty of Computer and Information Sciences
Publication Date
2007-01-31
Country of Publication
Egypt
No. of Pages
13
Main Subjects
Information Technology and Computer Science
Topics
Abstract EN
Although of the importance and the wide spread of the Arabic language few work regarding the Arabic language has been done in the area of text mining.
The work in this context aims to cluster similar Arabic Microsoft Documents based on word existence.
The proposed system consists of four stages which are keyword extraction, feature extraction, clustering and finally performance measuring.
Stemmer in is used to extract the most frequent terms along with their frequencies.
In the feature extraction stage, three features are used TF, IDF and TF_IDF.
In the clustering stage, three clustering algorithms are tried which are Simple Similarity-based Clustering, Similarity-based Clustering and Isodata algorithms.
Each clustering algorithm implements two distances; the Euclidean distance and the City-block distance.
The three features are tested using the three models.
The system was built using thirty Arabic documents and was tested using twenty other documents.
The results of the eighteen experiments show Isodata algorithm resulted in the best performance almost all the time.
Isodata algorithms implementing ECD have the best performance.
Also, Isodata has the best generalization.
Finally, the best choice of features is related to distance.
American Psychological Association (APA)
Hana, M. A.. 2007. An attempt for clustering arabic text. International Journal of Intelligent Computing and Information Sciences،Vol. 7, no. 1.
https://search.emarefa.net/detail/BIM-284988
Modern Language Association (MLA)
Hana, M. A.. An attempt for clustering arabic text. International Journal of Intelligent Computing and Information Sciences Vol. 7, no. 1 (Jan. 2007).
https://search.emarefa.net/detail/BIM-284988
American Medical Association (AMA)
Hana, M. A.. An attempt for clustering arabic text. International Journal of Intelligent Computing and Information Sciences. 2007. Vol. 7, no. 1.
https://search.emarefa.net/detail/BIM-284988
Data Type
Journal Articles
Language
English
Notes
Includes bibliographical references.
Record ID
BIM-284988