Using Natural Language Preprocessing Architecture (NLPA)‎ for Big Data Text Sources

Publication Date

2020-08-01

Country of Publication

Egypt

No. of Pages

Main Subjects

Mathematics

Abstract EN

During the last years, big data analysis has become a popular means of taking advantage of multiple (initially valueless) sources to find relevant knowledge about real domains.

However, a large number of big data sources provide textual unstructured data.

A proper analysis requires tools able to adequately combine big data and text-analysing techniques.

Keeping this in mind, we combined a pipelining framework (BDP4J (Big Data Pipelining For Java)) with the implementation of a set of text preprocessing techniques in order to create NLPA (Natural Language Preprocessing Architecture), an extendable open-source plugin implementing preprocessing steps that can be easily combined to create a pipeline.

Additionally, NLPA incorporates the possibility of generating datasets using either a classical token-based representation of data or newer synset-based datasets that would be further processed using semantic information (i.e., using ontologies).

This work presents a case study of NLPA operation covering the transformation of raw heterogeneous big data into different dataset representations (synsets and tokens) and using the Weka application programming interface (API) to launch two well-known classifiers.

American Psychological Association (APA)

Novo-Lourés, María& Pavón, Reyes& Laza, Rosalía& Ruano-Ordas, David& Méndez, Jose R.. 2020. Using Natural Language Preprocessing Architecture (NLPA) for Big Data Text Sources. Scientific Programming،Vol. 2020, no. 2020, pp.1-13.
https://search.emarefa.net/detail/BIM-1208988

Modern Language Association (MLA)

Novo-Lourés, María…[et al.]. Using Natural Language Preprocessing Architecture (NLPA) for Big Data Text Sources. Scientific Programming No. 2020 (2020), pp.1-13.
https://search.emarefa.net/detail/BIM-1208988

American Medical Association (AMA)

Novo-Lourés, María& Pavón, Reyes& Laza, Rosalía& Ruano-Ordas, David& Méndez, Jose R.. Using Natural Language Preprocessing Architecture (NLPA) for Big Data Text Sources. Scientific Programming. 2020. Vol. 2020, no. 2020, pp.1-13.
https://search.emarefa.net/detail/BIM-1208988

Data Type

Journal Articles

Language

English

Notes

Includes bibliographical references

Record ID

BIM-1208988

SaveSaved Print

Arab Citation & Impact Factor "Arcif"

Largest Arabic Database of Citations Analysis for the Arabic Scholarly Journals Issued in Arab World.

eMarefa Indicators
for Arab Scientific Production

"Kashif" for Checking Similarity or Plagiarism in the Arabic Researches. know more