Unification of multiple treebanks and testing them with statistical parser with support of large corpus as a lexical resource

Author

Ulaywi, Ahmad Husayn

Source

Engineering and Technology Journal

Issue

Vol. 34, Issue 5B (31 May. 2016), pp.711-720, 10 p.

Publisher

University of Technology

Publication Date

2016-05-31

Country of Publication

Iraq

No. of Pages

10

Main Subjects

Information Technology and Computer Science

Abstract EN

There are many Treebanks, texts with the parse tree, available for the researcher in the field of Natural Language Processing (NLP).

All these Treebanks are limited in size, and each one used private Context Free Grammar (CFG) production rules (private formalism) because its construction is time consuming and need to experts in the field of linguistics.

These Treebanks, as we know, can be used for statistical parsing and machine translation tests and other fields in NLP applications.

We propose, in this paper, to build large Treebank from multiple Treebanks for the same language.

Also, we propose to use an annotated corpus as a lexical resource.

Three English Treebanks are taken for our study which arePenn Treebank (PTB), GENIA Treebank (GTB) and British National Corpus (BNC).

Brown corpus is used as a lexical resource which contains approximately one million tokens annotated with part of speech tags for each.

Our work start by the unification of POS tagsets of the three Treebank then the mapping process between Brown Corpus tagset and the unified tagset is done.

This is done manually according to our experience in this field.

Also, all the non-terminals in the CFG production are unified.All the three Treebanks and the Brown corpus are rebuilt according to the new modification.

Our test for the proposed unification are made in three types: (i) statistical parsing test for each Treebank alone without modification, (ii) statistical parsing test for each Treebank alone after the modification, (iii) statistical parsing test for the collection of the three Treebanks after modification without support of lexical resource, and (iv) statistical parsing test for the collection of the three Treebanks after modification with support of lexical resource.

The unknown words are processed using a very simple suggested method.

We can show, simply in our work, that (a) the unification of multiple Treebanks can be done and will increase the accuracy.

(b) A large annotated corpus as Brown corpus can be used for (i) decreasing the unknown words and (ii) we can extract the probabilities nearest to the reality.

(c) The mapping between the unified tagset and the lexical tagset (used in Brown corpus) can be done straightforward.

American Psychological Association (APA)

Ulaywi, Ahmad Husayn. 2016. Unification of multiple treebanks and testing them with statistical parser with support of large corpus as a lexical resource. Engineering and Technology Journal،Vol. 34, no. 5B, pp.711-720.
https://search.emarefa.net/detail/BIM-783777

Modern Language Association (MLA)

Ulaywi, Ahmad Husayn. Unification of multiple treebanks and testing them with statistical parser with support of large corpus as a lexical resource. Engineering and Technology Journal Vol. 34, no. 5B (2016), pp.711-720.
https://search.emarefa.net/detail/BIM-783777

American Medical Association (AMA)

Ulaywi, Ahmad Husayn. Unification of multiple treebanks and testing them with statistical parser with support of large corpus as a lexical resource. Engineering and Technology Journal. 2016. Vol. 34, no. 5B, pp.711-720.
https://search.emarefa.net/detail/BIM-783777

Data Type

Journal Articles

Language

English

Notes

Includes bibliographical references : p. 720

Record ID

BIM-783777