Utilizing corpus statistics for Hindi word sense disambiguation

Joint Authors

Singh, Satyendr
Siddiqui, Tanveer

Source

The International Arab Journal of Information Technology

Issue

Vol. 12, Issue 6A(s) (31 Dec. 2015)9 p.

Publisher

Zarqa University

Publication Date

2015-12-31

Country of Publication

Jordan

No. of Pages

9

Main Subjects

Languages & Comparative Literature
Information Technology and Computer Science

Topics

Abstract EN

Word Sense Disambiguation (WSD) is the task of computational assignment of correct sense of a polysemous word in a given context.

This paper compares three WSD algorithms for Hindi WSD based on corpus statistics.

The first algorithm, called corpus-based Lesk, uses sense definitions and a sense tagged training corpus to learn weights of Content Words (CWs).

These weights are used in the disambiguation process to assign a score to each sense.

We experimented with four metrics for computing weight of matching words Term Frequency (TF), Inverse Document Frequency (IDF), Term Frequency-Inverse Document frequency (TF-IDF) and CW in a fixed window size.

The second algorithm uses conditional probability of words and phrases co-occurring with each sense of an ambiguous word in disambiguation.

The third algorithm is based on the classification information model.

The first method yields an overall maximum precision of 85.87% using TF-IDF weighting scheme.

The WSD algorithm using word co-occurrence statistics results in an average precision of 68.73%.

The WSD algorithm using classification information model results in an average precision of 76.34%.

All the three algorithms perform significantly better than direct overlap method in which case we achieve an average precision of 47.87%.

American Psychological Association (APA)

Singh, Satyendr& Siddiqui, Tanveer. 2015. Utilizing corpus statistics for Hindi word sense disambiguation. The International Arab Journal of Information Technology،Vol. 12, no. 6A(s).
https://search.emarefa.net/detail/BIM-655060

Modern Language Association (MLA)

Singh, Satyendr& Siddiqui, Tanveer. Utilizing corpus statistics for Hindi word sense disambiguation. The International Arab Journal of Information Technology Vol. 12, no. 6A (Dec. 2015).
https://search.emarefa.net/detail/BIM-655060

American Medical Association (AMA)

Singh, Satyendr& Siddiqui, Tanveer. Utilizing corpus statistics for Hindi word sense disambiguation. The International Arab Journal of Information Technology. 2015. Vol. 12, no. 6A(s).
https://search.emarefa.net/detail/BIM-655060

Data Type

Journal Articles

Language

English

Notes

Includes appendix.

Record ID

BIM-655060