Experimenting N-grams in text categorization

Joint Authors

Rahmun, Abd al-Latif
al-Berrichi, Zakariyya

Source

The International Arab Journal of Information Technology

Issue

Vol. 4, Issue 4 (31 Oct. 2007), pp.377-385, 9 p.

Publisher

Zarqa University

Publication Date

2007-10-31

Country of Publication

Jordan

No. of Pages

9

Main Subjects

Information Technology and Computer Science

Topics

Abstract EN

This paper deals with automatic supervised classification of documents.

The approach suggested is based on a vector representation of the documents centered not on the words but on the n-grams of characters for varying n.

The effects of this method are examined in several experiments using the multivariate chi-square to reduce the dimensionality, the cosine and Callback and Libeler distances, and two benchmark corpuses the routers-21578 newswire articles and the 20 newsgroups data for evaluation.

The evaluation was done, by using the macro averaged F1 function.

The results show the effectiveness of this approach compared to the Bag-Of-Word and stem representations.

American Psychological Association (APA)

Rahmun, Abd al-Latif& al-Berrichi, Zakariyya. 2007. Experimenting N-grams in text categorization. The International Arab Journal of Information Technology،Vol. 4, no. 4, pp.377-385.
https://search.emarefa.net/detail/BIM-11745

Modern Language Association (MLA)

Rahmun, Abd al-Latif& al-Berrichi, Zakariyya. Experimenting N-grams in text categorization. The International Arab Journal of Information Technology Vol. 4, no. 4 (Oct. 2007), pp.377-385.
https://search.emarefa.net/detail/BIM-11745

American Medical Association (AMA)

Rahmun, Abd al-Latif& al-Berrichi, Zakariyya. Experimenting N-grams in text categorization. The International Arab Journal of Information Technology. 2007. Vol. 4, no. 4, pp.377-385.
https://search.emarefa.net/detail/BIM-11745

Data Type

Journal Articles

Language

English

Notes

Includes bibliographical references : p. 384

Record ID

BIM-11745