An attempt for clustering arabic text

Publication Date

2007-01-31

Country of Publication

Egypt

No. of Pages

Main Subjects

Information Technology and Computer Science

Topics

Abstract EN

Although of the importance and the wide spread of the Arabic language few work regarding the Arabic language has been done in the area of text mining.

The work in this context aims to cluster similar Arabic Microsoft Documents based on word existence.

The proposed system consists of four stages which are keyword extraction, feature extraction, clustering and finally performance measuring.

Stemmer in is used to extract the most frequent terms along with their frequencies.

In the feature extraction stage, three features are used TF, IDF and TF_IDF.

In the clustering stage, three clustering algorithms are tried which are Simple Similarity-based Clustering, Similarity-based Clustering and Isodata algorithms.

Each clustering algorithm implements two distances; the Euclidean distance and the City-block distance.

The three features are tested using the three models.

The system was built using thirty Arabic documents and was tested using twenty other documents.

The results of the eighteen experiments show Isodata algorithm resulted in the best performance almost all the time.

Isodata algorithms implementing ECD have the best performance.

Also, Isodata has the best generalization.

Finally, the best choice of features is related to distance.

American Psychological Association (APA)

Hana, M. A.. 2007. An attempt for clustering arabic text. International Journal of Intelligent Computing and Information Sciences،Vol. 7, no. 1.
https://search.emarefa.net/detail/BIM-284988

Modern Language Association (MLA)

Hana, M. A.. An attempt for clustering arabic text. International Journal of Intelligent Computing and Information Sciences Vol. 7, no. 1 (Jan. 2007).
https://search.emarefa.net/detail/BIM-284988

American Medical Association (AMA)

Hana, M. A.. An attempt for clustering arabic text. International Journal of Intelligent Computing and Information Sciences. 2007. Vol. 7, no. 1.
https://search.emarefa.net/detail/BIM-284988

Data Type

Journal Articles

Language

English

Notes

Includes bibliographical references.

Record ID

BIM-284988

SaveSaved Print

Arab Citation & Impact Factor "Arcif"

Largest Arabic Database of Citations Analysis for the Arabic Scholarly Journals Issued in Arab World.

eMarefa Indicators
for Arab Scientific Production

"Kashif" for Checking Similarity or Plagiarism in the Arabic Researches. know more