Biclustering text documents using frequent term-based association mining

Dissertant

Ghazal, Muhammad Said Salih

Thesis advisor

al-Bahadili, Husayn

Comitee Members

al-Sarayirah, Bashshar
al-Hamami, Ala Husayn
Hattab, Izz al-Din Shakir Hasan

University

Arab Academy for Financial and Banking Sciences

Faculty

The Faculty of Information Systems and Technology

Department

Computer information systems

University Country

Jordan

Degree

Ph.D.

Degree Date

2011

English Abstract

Traditional text clustering algorithms like prototype clustering, hierarchical clustering, density-based clustering, and fuzzy clustering do not address the main problems in text mining; namely high dimensionality, very large database sizes, and lack of understandable cluster descriptions Blustering has the advantage of producing clusters of all sizes that provide more detail about groupings of data in a dataset.

It has been applied successfully to gene expression data by numerous algorithms like the Blustering MODULE (BIMODULE).

Frequent Term sets (FTS), created by association mining algorithms, have the advantage of being readily understandable and can be treated as candidate clusters.

A number of FTS-based clustering algorithms were developed such as the Frequent Term Clustering (FTC) and the BIMODULE.

These algorithms are still short of providing a full solution to the above three problems.

In this thesis, a new text document blustering algorithm we refer to as the Blustering Using Frequent Term Association mining” (BUFTA) is proposed.

BUFTA utilizes the methodologies of blustering and association mining, and use proper pre-processing steps to solve the above three problems.

The pre-processing steps (e.g., cleansing, stemming, normalization, and discretization) are necessary to reduce dimensionality and size of the input datasets.

Blustering helps discover smaller clusters that could be overlapping as well.

Association mining is used to produce understandable cluster-like FTSs.

We use the efficient Linear-time Closeditemset Miner (LCM) to produce these candidate clusters.

The F-measure was used to validate the accuracy of the BUFTA algorithm and its ability to produce quality clusters.

To measure performance, we also used the measured processing time.

BUFTA’s performance was compared to the FTC using the same corpora (Reuter’s news) for objective comparison, and proved superior in all aspects.

BUFTA showed major improvements in terms of cluster quality (precision, recall, and cluster understandability) and complexity of computation.

This superiority is due to BUFTA’s main advantages over the FTC: Proper pre-processing steps, the use of a more efficient Apriori variant (the LCM), and the use of blustering.

Main Subjects

Information Technology and Computer Science

Topics

No. of Pages

104

Table of Contents

Table of contents.

Abstract.

Chapter One : introduction.

Chapter Two : literature review.

Chapter Three : biclustering using frequent term association (BUFTA) mining.

Chapter Four : results and discussion.

Chapter Five : conclusions and recommendations for future work.

References.

American Psychological Association (APA)

Ghazal, Muhammad Said Salih. (2011). Biclustering text documents using frequent term-based association mining. (Doctoral dissertations Theses and Dissertations Master). Arab Academy for Financial and Banking Sciences, Jordan
https://search.emarefa.net/detail/BIM-306756

Modern Language Association (MLA)

Ghazal, Muhammad Said Salih. Biclustering text documents using frequent term-based association mining. (Doctoral dissertations Theses and Dissertations Master). Arab Academy for Financial and Banking Sciences. (2011).
https://search.emarefa.net/detail/BIM-306756

American Medical Association (AMA)

Ghazal, Muhammad Said Salih. (2011). Biclustering text documents using frequent term-based association mining. (Doctoral dissertations Theses and Dissertations Master). Arab Academy for Financial and Banking Sciences, Jordan
https://search.emarefa.net/detail/BIM-306756

Language

English

Data Type

Arab Theses

Record ID

BIM-306756