Issues of dialectal Saudi Twitter corpus

Author

al-Ruwayli, Mushrif

Source

The International Arab Journal of Information Technology

Issue

Vol. 17, Issue 3 (31 May. 2020), pp.367-374, 8 p.

Publisher

Zarqa University Deanship of Scientific Research

Publication Date

2020-05-31

Country of Publication

Jordan

No. of Pages

8

Main Subjects

Information Technology and Computer Science

Abstract EN

Text mining research relies heavily on the availability of a suitable corpus.

This paper presents a dialectal Saudi corpus that contains 207452 tweets generated by Saudi Twitter users.

In addition, a comparison between the Saudi tweets dataset, Egyptian Twitter corpus and Arabic top news raw corpus (representing Modern Standard Arabic (MSA) in various aspects, such as the differences between formal and colloquial texts was carried out.

Moreover, investigation into the issues and phenomena, such as shortening, concatenation, colloquial language, compounding, foreign language, spelling errors and neologisms on this type of dataset was performed.

American Psychological Association (APA)

al-Ruwayli, Mushrif. 2020. Issues of dialectal Saudi Twitter corpus. The International Arab Journal of Information Technology،Vol. 17, no. 3, pp.367-374.
https://search.emarefa.net/detail/BIM-962349

Modern Language Association (MLA)

al-Ruwayli, Mushrif. Issues of dialectal Saudi Twitter corpus. The International Arab Journal of Information Technology Vol. 17, no. 3 (May. 2020), pp.367-374.
https://search.emarefa.net/detail/BIM-962349

American Medical Association (AMA)

al-Ruwayli, Mushrif. Issues of dialectal Saudi Twitter corpus. The International Arab Journal of Information Technology. 2020. Vol. 17, no. 3, pp.367-374.
https://search.emarefa.net/detail/BIM-962349

Data Type

Journal Articles

Language

English

Notes

Includes bibliographical references : p. 373-374

Record ID

BIM-962349