An automated system for Arabic named entity recognition

مقدم أطروحة جامعية

Bani Bakr, Safiyah

مشرف أطروحة جامعية

al-Kuz, Akram

الجامعة

جامعة الأميرة سمية للتكنولوجيا

الكلية

كلية الملك الحسين لعلوم الحوسبة

القسم الأكاديمي

قسم علم الحاسوب

دولة الجامعة

الأردن

الدرجة العلمية

ماجستير

تاريخ الدرجة العلمية

2018

الملخص الإنجليزي

The rapid increase of Arabic textual information on the internet over the past decade raised a necessity for new sufficient techniques to classify, analyze, and process online information.

Recent researches proved that valuable information in a text is related to Named Entities (NEs).

Named Entity Recognition (NER) is the process of identifying proper names as well as temporal and numeric expressions in open-domain text.

NER plays a key role in Natural Language Processing (NLP) tasks such as Information Retrieval, Machine Translation, Information Extraction and Question Answering.

Thus, this topic has been a vital research arena and grabbed researchers’ attention.

The extraction of NEs in Arabic is challenging due to the scarcity of lexical resources for Arabic NEs.

In addition, the characteristics and peculiarities of Arabic make the research in this area a dilemma.

Regrettably, most of reliable NER systems for Arabic have been developed commercially and the approach used as well as the accuracy are not publicly available for research purposes.

Furthermore, the lack of the availability of adequate Arabic lexical resources for research makes the task of Arabic Named Entity Recognition (ANER) difficult.

In this thesis, we proposed ANER system named PSUT-ANERsys.

PSUT-ANERsys has two components; the first component is ANERtagger, which is an approach for ANER.

The second component is a new approach for developing Arabic NEs gazetteers automatically, which is named as Auto-ANERgazets.

ANERtagger is proposed specifically for Arabic text.

It is based on processing Arabic Wikipedia and a specific pipeline of multiple NLP techniques, such as Stanford Part of Speech (POS) Tagger, N-gram, Levenshtein Similarity, and Term Frequency-Inverse Document Frequency (TF-IDF) along with processing Arabic Wikipedia infoboxes.

Auto-ANERgazets investigates a significant number of NEs by crawling huge number of web documents and then utilizing Stanford NER Tagger along with Google Translate API in a smart way to address the limitations of Arabic lexical resources.

The performance of our PSUT-ANERsys is evaluated against the gold standard evaluation corpus; Benajiba’s ANERCorp.

The prefatory results show that the proposed approaches handle the problems of NER for Arabic with a high percentage of accuracy.

For example, the total number of NEs in Auto-ANERgazets until now exceeded 25,420 token, and they could be increased automatically.

Moreover, the evaluation measures for Auto-ANERgazets against ANERCorp are 90.15%, 90.56%, 90.35% for precision, recall, and F-measure respectively.

In addition, ANERtagger evaluation measures for precision, recall, and F-measure are 91.78%, 91.83%, and 91.80% respectively.

Furthermore, the performance of PSUT-ANERsys is benchmarked with other ANER systems

التخصصات الرئيسية

تكنولوجيا المعلومات وعلم الحاسوب

الموضوعات

عدد الصفحات

147

قائمة المحتويات

Table of contents.

Abstract.

Abstract in Arabic.

Chapter One : Introduction.

Chapter Two : Research background and literature review.

Chapter Three : Research methodology.

Chapter Four : Experimental setup and discussion.

Chapter Five : Research Conclusion and future work.

References.

نمط استشهاد جمعية علماء النفس الأمريكية (APA)

Bani Bakr, Safiyah. (2018). An automated system for Arabic named entity recognition. (Master's theses Theses and Dissertations Master). Princess Sumaya University for Technology, Jordan
https://search.emarefa.net/detail/BIM-833200

نمط استشهاد الجمعية الأمريكية للغات الحديثة (MLA)

Bani Bakr, Safiyah. An automated system for Arabic named entity recognition. (Master's theses Theses and Dissertations Master). Princess Sumaya University for Technology. (2018).
https://search.emarefa.net/detail/BIM-833200

نمط استشهاد الجمعية الطبية الأمريكية (AMA)

Bani Bakr, Safiyah. (2018). An automated system for Arabic named entity recognition. (Master's theses Theses and Dissertations Master). Princess Sumaya University for Technology, Jordan
https://search.emarefa.net/detail/BIM-833200

لغة النص

الإنجليزية

نوع البيانات

رسائل جامعية

رقم السجل

BIM-833200