n-Gram-Based Text Compression

المؤلفون المشاركون

Snášel, Václav
Nguyen, Vu H.
Nguyen, Hien T.
Duong, Hieu N.

المصدر

Computational Intelligence and Neuroscience

العدد

المجلد 2016، العدد 2016 (31 ديسمبر/كانون الأول 2015)، ص ص. 1-11، 11ص.

الناشر

Hindawi Publishing Corporation

تاريخ النشر

2016-11-14

دولة النشر

مصر

عدد الصفحات

11

التخصصات الرئيسية

الأحياء

الملخص EN

We propose an efficient method for compressing Vietnamese text using n-gram dictionaries.

It has a significant compression ratio in comparison with those of state-of-the-art methods on the same dataset.

Given a text, first, the proposed method splits it into n-grams and then encodes them based on n-gram dictionaries.

In the encoding phase, we use a sliding window with a size that ranges from bigram to five grams to obtain the best encoding stream.

Each n-gram is encoded by two to four bytes accordingly based on its corresponding n-gram dictionary.

We collected 2.5 GB text corpus from some Vietnamese news agencies to build n-gram dictionaries from unigram to five grams and achieve dictionaries with a size of 12 GB in total.

In order to evaluate our method, we collected a testing set of 10 different text files with different sizes.

The experimental results indicate that our method achieves compression ratio around 90% and outperforms state-of-the-art methods.

نمط استشهاد جمعية علماء النفس الأمريكية (APA)

Nguyen, Vu H.& Nguyen, Hien T.& Duong, Hieu N.& Snášel, Václav. 2016. n-Gram-Based Text Compression. Computational Intelligence and Neuroscience،Vol. 2016, no. 2016, pp.1-11.
https://search.emarefa.net/detail/BIM-1099822

نمط استشهاد الجمعية الأمريكية للغات الحديثة (MLA)

Nguyen, Vu H.…[et al.]. n-Gram-Based Text Compression. Computational Intelligence and Neuroscience Vol. 2016, no. 2016 (2015), pp.1-11.
https://search.emarefa.net/detail/BIM-1099822

نمط استشهاد الجمعية الطبية الأمريكية (AMA)

Nguyen, Vu H.& Nguyen, Hien T.& Duong, Hieu N.& Snášel, Václav. n-Gram-Based Text Compression. Computational Intelligence and Neuroscience. 2016. Vol. 2016, no. 2016, pp.1-11.
https://search.emarefa.net/detail/BIM-1099822

نوع البيانات

مقالات

لغة النص

الإنجليزية

الملاحظات

Includes bibliographical references

رقم السجل

BIM-1099822