Comprehensive stemmer for morphologically rich urdu language

المؤلفون المشاركون

Salim, Muhammad
Ali, Mubashir
Khalid, Shehzad

المصدر

The International Arab Journal of Information Technology

العدد

المجلد 16، العدد 1 (31 يناير/كانون الثاني 2019)، ص ص. 138-147، 10ص.

الناشر

جامعة الزرقاء

تاريخ النشر

2019-01-31

دولة النشر

الأردن

عدد الصفحات

10

التخصصات الرئيسية

تكنولوجيا المعلومات وعلم الحاسوب

الملخص EN

Urdu language is used by approximately 200 million people for spoken and written communication.

Bulk of unstructured Urdu textual data is available in the world.

We can employ data mining techniques to extract useful information from such a large potential information base.

There are many text processing systems that are available.

However, these systems are mostly language specific with the large proportion of systems are applicable to English text.

This is primarily due to the language dependant pre-processing systems mainly the stemming requirement.

Stemming is a vital pre-processing step in the text mining process and its core aim is to reduce many grammatical words form e.g., parts of speech, gender, tense etc.

to their root form.

In this proposed work, we have developed a rule based comprehensive stemming method for Urdu text.

This proposed Urdu stemmer has the ability to generate the stem of Urdu words as well as loan words (words belonging to borrowed language i.e.

Arabic, Persian, Turkish, etc) by removing prefix infix, and suffix.

This proposed stemming technique introduced six novel Urdu infix words classes and minimum word length rule.

In order to cope with the challenge of Urdu infix stemming, we have developed infix stripping rules for introduced infix words classes and generic rules for prefix and suffix stemming.

The experimental results show the superiority of our proposed stemming approach as compared to existing technique

نمط استشهاد جمعية علماء النفس الأمريكية (APA)

Ali, Mubashir& Khalid, Shehzad& Salim, Muhammad. 2019. Comprehensive stemmer for morphologically rich urdu language. The International Arab Journal of Information Technology،Vol. 16, no. 1, pp.138-147.
https://search.emarefa.net/detail/BIM-883575

نمط استشهاد الجمعية الأمريكية للغات الحديثة (MLA)

Ali, Mubashir…[et al.]. Comprehensive stemmer for morphologically rich urdu language. The International Arab Journal of Information Technology Vol. 16, no. 1 (Jan. 2019), pp.138-147.
https://search.emarefa.net/detail/BIM-883575

نمط استشهاد الجمعية الطبية الأمريكية (AMA)

Ali, Mubashir& Khalid, Shehzad& Salim, Muhammad. Comprehensive stemmer for morphologically rich urdu language. The International Arab Journal of Information Technology. 2019. Vol. 16, no. 1, pp.138-147.
https://search.emarefa.net/detail/BIM-883575

نوع البيانات

مقالات

لغة النص

الإنجليزية

الملاحظات

Includes appendix : p. 147

رقم السجل

BIM-883575