An automated system for Arabic named entity recognition

Dissertant

Bani Bakr, Safiyah

Thesis advisor

al-Kuz, Akram

University

Princess Sumaya University for Technology

Faculty

King Hussein Faculty for Computing Sciences

Department

Department of Computer Sciences

University Country

Jordan

Degree

Master

Degree Date

2018

English Abstract

The rapid increase of Arabic textual information on the internet over the past decade raised a necessity for new sufficient techniques to classify, analyze, and process online information.

Recent researches proved that valuable information in a text is related to Named Entities (NEs).

Named Entity Recognition (NER) is the process of identifying proper names as well as temporal and numeric expressions in open-domain text.

NER plays a key role in Natural Language Processing (NLP) tasks such as Information Retrieval, Machine Translation, Information Extraction and Question Answering.

Thus, this topic has been a vital research arena and grabbed researchers’ attention.

The extraction of NEs in Arabic is challenging due to the scarcity of lexical resources for Arabic NEs.

In addition, the characteristics and peculiarities of Arabic make the research in this area a dilemma.

Regrettably, most of reliable NER systems for Arabic have been developed commercially and the approach used as well as the accuracy are not publicly available for research purposes.

Furthermore, the lack of the availability of adequate Arabic lexical resources for research makes the task of Arabic Named Entity Recognition (ANER) difficult.

In this thesis, we proposed ANER system named PSUT-ANERsys.

PSUT-ANERsys has two components; the first component is ANERtagger, which is an approach for ANER.

The second component is a new approach for developing Arabic NEs gazetteers automatically, which is named as Auto-ANERgazets.

ANERtagger is proposed specifically for Arabic text.

It is based on processing Arabic Wikipedia and a specific pipeline of multiple NLP techniques, such as Stanford Part of Speech (POS) Tagger, N-gram, Levenshtein Similarity, and Term Frequency-Inverse Document Frequency (TF-IDF) along with processing Arabic Wikipedia infoboxes.

Auto-ANERgazets investigates a significant number of NEs by crawling huge number of web documents and then utilizing Stanford NER Tagger along with Google Translate API in a smart way to address the limitations of Arabic lexical resources.

The performance of our PSUT-ANERsys is evaluated against the gold standard evaluation corpus; Benajiba’s ANERCorp.

The prefatory results show that the proposed approaches handle the problems of NER for Arabic with a high percentage of accuracy.

For example, the total number of NEs in Auto-ANERgazets until now exceeded 25,420 token, and they could be increased automatically.

Moreover, the evaluation measures for Auto-ANERgazets against ANERCorp are 90.15%, 90.56%, 90.35% for precision, recall, and F-measure respectively.

In addition, ANERtagger evaluation measures for precision, recall, and F-measure are 91.78%, 91.83%, and 91.80% respectively.

Furthermore, the performance of PSUT-ANERsys is benchmarked with other ANER systems

Main Subjects

Information Technology and Computer Science

Topics

Pattern recognition

No. of Pages

147

Table of contents.

Abstract.

Abstract in Arabic.

Chapter One : Introduction.

Chapter Two : Research background and literature review.

Chapter Three : Research methodology.

Chapter Four : Experimental setup and discussion.

Chapter Five : Research Conclusion and future work.

References.

American Psychological Association (APA)

Bani Bakr, Safiyah. (2018). An automated system for Arabic named entity recognition. (Master's theses Theses and Dissertations Master). Princess Sumaya University for Technology, Jordan
https://search.emarefa.net/detail/BIM-833200

Modern Language Association (MLA)

Bani Bakr, Safiyah. An automated system for Arabic named entity recognition. (Master's theses Theses and Dissertations Master). Princess Sumaya University for Technology. (2018).
https://search.emarefa.net/detail/BIM-833200

American Medical Association (AMA)

Language

English

Data Type

Arab Theses

Record ID

BIM-833200

SaveSaved Print

Arab Citation & Impact Factor "Arcif"

Largest Arabic Database of Citations Analysis for the Arabic Scholarly Journals Issued in Arab World.

e-Marefa Platform for Arabic Textbook.

"Kashif" for Checking Similarity or Plagiarism in the Arabic Researches. know more