Utilizing corpus statistics for Hindi word sense disambiguation

تاريخ النشر

2015-12-31

دولة النشر

الأردن

عدد الصفحات

التخصصات الرئيسية

اللغات والآداب المقارنة
تكنولوجيا المعلومات وعلم الحاسوب

الموضوعات

الملخص EN

Word Sense Disambiguation (WSD) is the task of computational assignment of correct sense of a polysemous word in a given context.

This paper compares three WSD algorithms for Hindi WSD based on corpus statistics.

The first algorithm, called corpus-based Lesk, uses sense definitions and a sense tagged training corpus to learn weights of Content Words (CWs).

These weights are used in the disambiguation process to assign a score to each sense.

We experimented with four metrics for computing weight of matching words Term Frequency (TF), Inverse Document Frequency (IDF), Term Frequency-Inverse Document frequency (TF-IDF) and CW in a fixed window size.

The second algorithm uses conditional probability of words and phrases co-occurring with each sense of an ambiguous word in disambiguation.

The third algorithm is based on the classification information model.

The first method yields an overall maximum precision of 85.87% using TF-IDF weighting scheme.

The WSD algorithm using word co-occurrence statistics results in an average precision of 68.73%.

The WSD algorithm using classification information model results in an average precision of 76.34%.

All the three algorithms perform significantly better than direct overlap method in which case we achieve an average precision of 47.87%.

نمط استشهاد جمعية علماء النفس الأمريكية (APA)

Singh, Satyendr& Siddiqui, Tanveer. 2015. Utilizing corpus statistics for Hindi word sense disambiguation. The International Arab Journal of Information Technology،Vol. 12, no. 6A(s).
https://search.emarefa.net/detail/BIM-655060

نمط استشهاد الجمعية الأمريكية للغات الحديثة (MLA)

Singh, Satyendr& Siddiqui, Tanveer. Utilizing corpus statistics for Hindi word sense disambiguation. The International Arab Journal of Information Technology Vol. 12, no. 6A (Dec. 2015).
https://search.emarefa.net/detail/BIM-655060

نمط استشهاد الجمعية الطبية الأمريكية (AMA)

Singh, Satyendr& Siddiqui, Tanveer. Utilizing corpus statistics for Hindi word sense disambiguation. The International Arab Journal of Information Technology. 2015. Vol. 12, no. 6A(s).
https://search.emarefa.net/detail/BIM-655060

نوع البيانات

مقالات

لغة النص

الإنجليزية

الملاحظات

Includes appendix.

رقم السجل

BIM-655060

حفظتم الحفظ طباعة

قاعدة معامل التأثير والاستشهادات المرجعية العربي "ارسيف Arcif"

أضخم قاعدة بيانات عربية للاستشهادات المرجعية للمجلات العلمية المحكمة الصادرة في العالم العربي

مرصد "معرفة"
لقياس الإنتاج العلمي العربي

تقوم هذه الخدمة بالتحقق من التشابه أو الانتحال في الأبحاث والمقالات العلمية والأطروحات الجامعية والكتب والأبحاث باللغة العربية، وتحديد درجة التشابه أو أصالة الأعمال البحثية وحماية ملكيتها الفكرية. تعرف اكثر