An automated system for Arabic named entity recognition
Dissertant
Thesis advisor
University
Princess Sumaya University for Technology
Faculty
King Hussein Faculty for Computing Sciences
Department
Department of Computer Sciences
University Country
Jordan
Degree
Master
Degree Date
2018
English Abstract
The rapid increase of Arabic textual information on the internet over the past decade raised a necessity for new sufficient techniques to classify, analyze, and process online information.
Recent researches proved that valuable information in a text is related to Named Entities (NEs).
Named Entity Recognition (NER) is the process of identifying proper names as well as temporal and numeric expressions in open-domain text.
NER plays a key role in Natural Language Processing (NLP) tasks such as Information Retrieval, Machine Translation, Information Extraction and Question Answering.
Thus, this topic has been a vital research arena and grabbed researchers’ attention.
The extraction of NEs in Arabic is challenging due to the scarcity of lexical resources for Arabic NEs.
In addition, the characteristics and peculiarities of Arabic make the research in this area a dilemma.
Regrettably, most of reliable NER systems for Arabic have been developed commercially and the approach used as well as the accuracy are not publicly available for research purposes.
Furthermore, the lack of the availability of adequate Arabic lexical resources for research makes the task of Arabic Named Entity Recognition (ANER) difficult.
In this thesis, we proposed ANER system named PSUT-ANERsys.
PSUT-ANERsys has two components; the first component is ANERtagger, which is an approach for ANER.
The second component is a new approach for developing Arabic NEs gazetteers automatically, which is named as Auto-ANERgazets.
ANERtagger is proposed specifically for Arabic text.
It is based on processing Arabic Wikipedia and a specific pipeline of multiple NLP techniques, such as Stanford Part of Speech (POS) Tagger, N-gram, Levenshtein Similarity, and Term Frequency-Inverse Document Frequency (TF-IDF) along with processing Arabic Wikipedia infoboxes.
Auto-ANERgazets investigates a significant number of NEs by crawling huge number of web documents and then utilizing Stanford NER Tagger along with Google Translate API in a smart way to address the limitations of Arabic lexical resources.
The performance of our PSUT-ANERsys is evaluated against the gold standard evaluation corpus; Benajiba’s ANERCorp.
The prefatory results show that the proposed approaches handle the problems of NER for Arabic with a high percentage of accuracy.
For example, the total number of NEs in Auto-ANERgazets until now exceeded 25,420 token, and they could be increased automatically.
Moreover, the evaluation measures for Auto-ANERgazets against ANERCorp are 90.15%, 90.56%, 90.35% for precision, recall, and F-measure respectively.
In addition, ANERtagger evaluation measures for precision, recall, and F-measure are 91.78%, 91.83%, and 91.80% respectively.
Furthermore, the performance of PSUT-ANERsys is benchmarked with other ANER systems
Main Subjects
Information Technology and Computer Science
Topics
No. of Pages
147
Table of Contents
Table of contents.
Abstract.
Abstract in Arabic.
Chapter One : Introduction.
Chapter Two : Research background and literature review.
Chapter Three : Research methodology.
Chapter Four : Experimental setup and discussion.
Chapter Five : Research Conclusion and future work.
References.
American Psychological Association (APA)
Bani Bakr, Safiyah. (2018). An automated system for Arabic named entity recognition. (Master's theses Theses and Dissertations Master). Princess Sumaya University for Technology, Jordan
https://search.emarefa.net/detail/BIM-833200
Modern Language Association (MLA)
Bani Bakr, Safiyah. An automated system for Arabic named entity recognition. (Master's theses Theses and Dissertations Master). Princess Sumaya University for Technology. (2018).
https://search.emarefa.net/detail/BIM-833200
American Medical Association (AMA)
Bani Bakr, Safiyah. (2018). An automated system for Arabic named entity recognition. (Master's theses Theses and Dissertations Master). Princess Sumaya University for Technology, Jordan
https://search.emarefa.net/detail/BIM-833200
Language
English
Data Type
Arab Theses
Record ID
BIM-833200