Utilizing corpus statistics for Hindi word sense disambiguation

المؤلفون المشاركون

Singh, Satyendr
Siddiqui, Tanveer

المصدر

The International Arab Journal of Information Technology

العدد

المجلد 12، العدد 6A(s) (31 ديسمبر/كانون الأول 2015)9ص.

الناشر

جامعة الزرقاء

تاريخ النشر

2015-12-31

دولة النشر

الأردن

عدد الصفحات

9

التخصصات الرئيسية

اللغات والآداب المقارنة
تكنولوجيا المعلومات وعلم الحاسوب

الموضوعات

الملخص EN

Word Sense Disambiguation (WSD) is the task of computational assignment of correct sense of a polysemous word in a given context.

This paper compares three WSD algorithms for Hindi WSD based on corpus statistics.

The first algorithm, called corpus-based Lesk, uses sense definitions and a sense tagged training corpus to learn weights of Content Words (CWs).

These weights are used in the disambiguation process to assign a score to each sense.

We experimented with four metrics for computing weight of matching words Term Frequency (TF), Inverse Document Frequency (IDF), Term Frequency-Inverse Document frequency (TF-IDF) and CW in a fixed window size.

The second algorithm uses conditional probability of words and phrases co-occurring with each sense of an ambiguous word in disambiguation.

The third algorithm is based on the classification information model.

The first method yields an overall maximum precision of 85.87% using TF-IDF weighting scheme.

The WSD algorithm using word co-occurrence statistics results in an average precision of 68.73%.

The WSD algorithm using classification information model results in an average precision of 76.34%.

All the three algorithms perform significantly better than direct overlap method in which case we achieve an average precision of 47.87%.

نمط استشهاد جمعية علماء النفس الأمريكية (APA)

Singh, Satyendr& Siddiqui, Tanveer. 2015. Utilizing corpus statistics for Hindi word sense disambiguation. The International Arab Journal of Information Technology،Vol. 12, no. 6A(s).
https://search.emarefa.net/detail/BIM-655060

نمط استشهاد الجمعية الأمريكية للغات الحديثة (MLA)

Singh, Satyendr& Siddiqui, Tanveer. Utilizing corpus statistics for Hindi word sense disambiguation. The International Arab Journal of Information Technology Vol. 12, no. 6A (Dec. 2015).
https://search.emarefa.net/detail/BIM-655060

نمط استشهاد الجمعية الطبية الأمريكية (AMA)

Singh, Satyendr& Siddiqui, Tanveer. Utilizing corpus statistics for Hindi word sense disambiguation. The International Arab Journal of Information Technology. 2015. Vol. 12, no. 6A(s).
https://search.emarefa.net/detail/BIM-655060

نوع البيانات

مقالات

لغة النص

الإنجليزية

الملاحظات

Includes appendix.

رقم السجل

BIM-655060