Automatic clustering of Arabic documents

Other Title(s)

التنجميع التلقائي للوثائق العربية

University

Princess Sumaya University for Technology

Faculty

King Hussein Faculty for Computing Sciences

Department

Department of Computer Sciences

University Country

Jordan

Degree

Master

Degree Date

2012

English Abstract

Arabic Documents Clustering is an important task for obtaining good results in traditional Information Retrieval systems especially with the rapid growth of the number of online documents existed in Arabic language.

Document clustering aims to automatically group similar documents into clusters so that each document in a set are similar to other documents in the same set and dissimilar to documents in other sets.

In this study we will discuss Natural languages processing and their applications then we will study current clustering algorithms and clustering of Arabic documents and their problems which are : stemming vs.

rooting, synonyms and homonyms, antonyms when removing stop words, words order and weighting, strange words and names and sentence level meaning.

Then we proposed solutions and new techniques to improve the clustering quality and solve current problems, the proposed solutions are : using a combination of all stems and all roots to represent the document on all level of analysis, implement a dictionary to find all the synonyms and possible meanings for all the words in the documents set and to find the antonyms of the word if it is preceded by negation, using N-Grams (unigram, bigram and trigram) instead of tokenization, and to use weights for words based on term frequency inverse document frequency instead of counting the frequency of words only, the improvements applied on a non-hierarchal approach.

The study concludes that these problems can be solved totally or partially by using the suggested solutions and that the clustering quality can be improved by up to 55 % than the traditional clustering (which is based on five stages preprocessing only) when using the suggested improvements and techniques.

Main Subjects

Information Technology and Computer Science

Topics

No. of Pages

129

Table of contents.

Abstract.

Chapter One : introduction.

Chapter Two : natural languages processing.

Chapter Three : documents clustering.

Chapter Four : arabic documents clustering.

Chapter Five : analysis and results.

References.

American Psychological Association (APA)

al-Saadi, Sami Abd al-Karim. (2012). Automatic clustering of Arabic documents. (Master's theses Theses and Dissertations Master). Princess Sumaya University for Technology, Jordan
https://search.emarefa.net/detail/BIM-305146

Modern Language Association (MLA)

al-Saadi, Sami Abd al-Karim. Automatic clustering of Arabic documents. (Master's theses Theses and Dissertations Master). Princess Sumaya University for Technology. (2012).
https://search.emarefa.net/detail/BIM-305146

American Medical Association (AMA)

Language

English

Data Type

Arab Theses

Record ID

BIM-305146

SaveSaved Print

Arab Citation & Impact Factor "Arcif"

Largest Arabic Database of Citations Analysis for the Arabic Scholarly Journals Issued in Arab World.

eMarefa Indicators
for Arab Scientific Production

"Kashif" for Checking Similarity or Plagiarism in the Arabic Researches. know more

e-Marefa