Utilizing corpus statistics for Hindi word sense disambiguation

Publication Date

2015-12-31

Country of Publication

Jordan

No. of Pages

Main Subjects

Languages & Comparative Literature
Information Technology and Computer Science

Topics

Abstract EN

Word Sense Disambiguation (WSD) is the task of computational assignment of correct sense of a polysemous word in a given context.

This paper compares three WSD algorithms for Hindi WSD based on corpus statistics.

The first algorithm, called corpus-based Lesk, uses sense definitions and a sense tagged training corpus to learn weights of Content Words (CWs).

These weights are used in the disambiguation process to assign a score to each sense.

We experimented with four metrics for computing weight of matching words Term Frequency (TF), Inverse Document Frequency (IDF), Term Frequency-Inverse Document frequency (TF-IDF) and CW in a fixed window size.

The second algorithm uses conditional probability of words and phrases co-occurring with each sense of an ambiguous word in disambiguation.

The third algorithm is based on the classification information model.

The first method yields an overall maximum precision of 85.87% using TF-IDF weighting scheme.

The WSD algorithm using word co-occurrence statistics results in an average precision of 68.73%.

The WSD algorithm using classification information model results in an average precision of 76.34%.

All the three algorithms perform significantly better than direct overlap method in which case we achieve an average precision of 47.87%.

American Psychological Association (APA)

Singh, Satyendr& Siddiqui, Tanveer. 2015. Utilizing corpus statistics for Hindi word sense disambiguation. The International Arab Journal of Information Technology،Vol. 12, no. 6A(s).
https://search.emarefa.net/detail/BIM-655060

Modern Language Association (MLA)

Singh, Satyendr& Siddiqui, Tanveer. Utilizing corpus statistics for Hindi word sense disambiguation. The International Arab Journal of Information Technology Vol. 12, no. 6A (Dec. 2015).
https://search.emarefa.net/detail/BIM-655060

American Medical Association (AMA)

Singh, Satyendr& Siddiqui, Tanveer. Utilizing corpus statistics for Hindi word sense disambiguation. The International Arab Journal of Information Technology. 2015. Vol. 12, no. 6A(s).
https://search.emarefa.net/detail/BIM-655060

Data Type

Journal Articles

Language

English

Notes

Includes appendix.

Record ID

BIM-655060

SaveSaved Print

Arab Citation & Impact Factor "Arcif"

Largest Arabic Database of Citations Analysis for the Arabic Scholarly Journals Issued in Arab World.

eMarefa Indicators
for Arab Scientific Production

"Kashif" for Checking Similarity or Plagiarism in the Arabic Researches. know more