Automatic domain-relevant collocation extraction from Arabic corpus

Other Title(s)

استخراج المصطلحات العربية المرتبطة بمجال معين من مجموعة نصوص عربية آليا

Publication Date

2014-12-31

Country of Publication

Palestine (Gaza Strip)

No. of Pages

Main Subjects

Information Technology and Computer Science

Topics

Data processing

Abstract AR

تم اقتراح طريقة آلية لاستخراج المصطلحات العربية المرتبطة بمجال معين من مجموعة نصوص عربية.

استخدمت هذه الطريقة الأساليب اللغوية و الإحصائية لاستخراج المصطلحات ذات الصلة بمجال محدد و إسنادها إلى هذا المجال.

من أجل تحقيق الطريقة المقترحة استخدمنا مكنزا (Corpus) عربيا مقسما إلى عشرة مجالات.

الطريقة المقترحة تقوم بمعالجة هذه المستندات معالجة لغوية خفيفة (Light stemming) ثم تستخرج المصطلحات المرشحة.

بعد ذلك يتم تقييم كل مصطلح من المصطلحات المرشحة (Candidate terms) بناء على مدى انتشار المصطلح داخل المجال المحدد و خارجه و مدى ارتباطه بهذا المجال.

و من ثم يخصص المصطلح المرشح للمجال ذو الوزن الأكبر لنحصل بعدها على مصفوفة مصطلحات المجالات (Domains term matrix).

و لاختبار مدى فاعلية هذه الطريقة تم استخدام هذه المصفوفة في عملية تصنيف بعض المستندات أو النصوص و تحديد مجالاتها مع العلم بأن مجالاتها كانت محددة مسبقا و قد تم تصميم مصنف يعتمد على مصفوفة مصطلحات المجال و كانت النتائج ممتازة في أغلب المجالات بحيث حققت نسبة دقة تجاوزت 90%.

Abstract EN

An approach for automatic domain-relevant collocation extraction from Arabic text corpus is proposed.

It uses naïve linguistic and statistical methods to extract collocations and relate them to specific domains depending on prevalence and tendency collocation ranking mechanism.

In order to realize the proposed approach we use a corpus separated into ten domains.

The proposed approach starts with preprocessing this corpus, then extracting candidate collocations.

After that, it ranks the candidate collocations depending on the distributional behavior of candidate collocations within the domain and across the rest of the corpus.

Then we distribute the candidate collocations over the domains depending on their rank values to get domains' term matrix.

Finally, we evaluate the resulting collocation matrix by using it to classify the domain of a number of documents.

The results are encouraging in most domains such that the achieved rate of accuracy exceeded 90 %.

American Psychological Association (APA)

Barakah, Ribhi S.& Fayyd, Manar S.. 2014. Automatic domain-relevant collocation extraction from Arabic corpus. IUG Journal of Natural Studies،Vol. 22, no. 2, pp.30-44.
https://search.emarefa.net/detail/BIM-382902

Modern Language Association (MLA)

Barakah, Ribhi S.& Fayyd, Manar S.. Automatic domain-relevant collocation extraction from Arabic corpus. IUG Journal of Natural Studies Vol. 22, no. 2 (2014), pp.30-44.
https://search.emarefa.net/detail/BIM-382902

American Medical Association (AMA)

Barakah, Ribhi S.& Fayyd, Manar S.. Automatic domain-relevant collocation extraction from Arabic corpus. IUG Journal of Natural Studies. 2014. Vol. 22, no. 2, pp.30-44.
https://search.emarefa.net/detail/BIM-382902

Data Type

Journal Articles

Language

English

Notes

Includes bibliographical references : p. 42-44

Record ID

BIM-382902

SaveSaved Print

Arab Citation & Impact Factor "Arcif"

Largest Arabic Database of Citations Analysis for the Arabic Scholarly Journals Issued in Arab World.

eMarefa Indicators
for Arab Scientific Production

"Kashif" for Checking Similarity or Plagiarism in the Arabic Researches. know more