Utilizing corpus statistics for Hindi word sense disambiguation
Joint Authors
Singh, Satyendr
Siddiqui, Tanveer
Source
The International Arab Journal of Information Technology
Issue
Vol. 12, Issue 6A(s) (31 Dec. 2015)9 p.
Publisher
Publication Date
2015-12-31
Country of Publication
Jordan
No. of Pages
9
Main Subjects
Languages & Comparative Literature
Information Technology and Computer Science
Topics
Abstract EN
Word Sense Disambiguation (WSD) is the task of computational assignment of correct sense of a polysemous word in a given context.
This paper compares three WSD algorithms for Hindi WSD based on corpus statistics.
The first algorithm, called corpus-based Lesk, uses sense definitions and a sense tagged training corpus to learn weights of Content Words (CWs).
These weights are used in the disambiguation process to assign a score to each sense.
We experimented with four metrics for computing weight of matching words Term Frequency (TF), Inverse Document Frequency (IDF), Term Frequency-Inverse Document frequency (TF-IDF) and CW in a fixed window size.
The second algorithm uses conditional probability of words and phrases co-occurring with each sense of an ambiguous word in disambiguation.
The third algorithm is based on the classification information model.
The first method yields an overall maximum precision of 85.87% using TF-IDF weighting scheme.
The WSD algorithm using word co-occurrence statistics results in an average precision of 68.73%.
The WSD algorithm using classification information model results in an average precision of 76.34%.
All the three algorithms perform significantly better than direct overlap method in which case we achieve an average precision of 47.87%.
American Psychological Association (APA)
Singh, Satyendr& Siddiqui, Tanveer. 2015. Utilizing corpus statistics for Hindi word sense disambiguation. The International Arab Journal of Information Technology،Vol. 12, no. 6A(s).
https://search.emarefa.net/detail/BIM-655060
Modern Language Association (MLA)
Singh, Satyendr& Siddiqui, Tanveer. Utilizing corpus statistics for Hindi word sense disambiguation. The International Arab Journal of Information Technology Vol. 12, no. 6A (Dec. 2015).
https://search.emarefa.net/detail/BIM-655060
American Medical Association (AMA)
Singh, Satyendr& Siddiqui, Tanveer. Utilizing corpus statistics for Hindi word sense disambiguation. The International Arab Journal of Information Technology. 2015. Vol. 12, no. 6A(s).
https://search.emarefa.net/detail/BIM-655060
Data Type
Journal Articles
Language
English
Notes
Includes appendix.
Record ID
BIM-655060