Biclustering text documents using frequent term-based association mining
Dissertant
Thesis advisor
Comitee Members
al-Sarayirah, Bashshar
al-Hamami, Ala Husayn
Hattab, Izz al-Din Shakir Hasan
University
Arab Academy for Financial and Banking Sciences
Faculty
The Faculty of Information Systems and Technology
Department
Computer information systems
University Country
Jordan
Degree
Ph.D.
Degree Date
2011
English Abstract
Traditional text clustering algorithms like prototype clustering, hierarchical clustering, density-based clustering, and fuzzy clustering do not address the main problems in text mining; namely high dimensionality, very large database sizes, and lack of understandable cluster descriptions Blustering has the advantage of producing clusters of all sizes that provide more detail about groupings of data in a dataset.
It has been applied successfully to gene expression data by numerous algorithms like the Blustering MODULE (BIMODULE).
Frequent Term sets (FTS), created by association mining algorithms, have the advantage of being readily understandable and can be treated as candidate clusters.
A number of FTS-based clustering algorithms were developed such as the Frequent Term Clustering (FTC) and the BIMODULE.
These algorithms are still short of providing a full solution to the above three problems.
In this thesis, a new text document blustering algorithm we refer to as the Blustering Using Frequent Term Association mining” (BUFTA) is proposed.
BUFTA utilizes the methodologies of blustering and association mining, and use proper pre-processing steps to solve the above three problems.
The pre-processing steps (e.g., cleansing, stemming, normalization, and discretization) are necessary to reduce dimensionality and size of the input datasets.
Blustering helps discover smaller clusters that could be overlapping as well.
Association mining is used to produce understandable cluster-like FTSs.
We use the efficient Linear-time Closeditemset Miner (LCM) to produce these candidate clusters.
The F-measure was used to validate the accuracy of the BUFTA algorithm and its ability to produce quality clusters.
To measure performance, we also used the measured processing time.
BUFTA’s performance was compared to the FTC using the same corpora (Reuter’s news) for objective comparison, and proved superior in all aspects.
BUFTA showed major improvements in terms of cluster quality (precision, recall, and cluster understandability) and complexity of computation.
This superiority is due to BUFTA’s main advantages over the FTC: Proper pre-processing steps, the use of a more efficient Apriori variant (the LCM), and the use of blustering.
Main Subjects
Information Technology and Computer Science
Topics
No. of Pages
104
Table of Contents
Table of contents.
Abstract.
Chapter One : introduction.
Chapter Two : literature review.
Chapter Three : biclustering using frequent term association (BUFTA) mining.
Chapter Four : results and discussion.
Chapter Five : conclusions and recommendations for future work.
References.
American Psychological Association (APA)
Ghazal, Muhammad Said Salih. (2011). Biclustering text documents using frequent term-based association mining. (Doctoral dissertations Theses and Dissertations Master). Arab Academy for Financial and Banking Sciences, Jordan
https://search.emarefa.net/detail/BIM-306756
Modern Language Association (MLA)
Ghazal, Muhammad Said Salih. Biclustering text documents using frequent term-based association mining. (Doctoral dissertations Theses and Dissertations Master). Arab Academy for Financial and Banking Sciences. (2011).
https://search.emarefa.net/detail/BIM-306756
American Medical Association (AMA)
Ghazal, Muhammad Said Salih. (2011). Biclustering text documents using frequent term-based association mining. (Doctoral dissertations Theses and Dissertations Master). Arab Academy for Financial and Banking Sciences, Jordan
https://search.emarefa.net/detail/BIM-306756
Language
English
Data Type
Arab Theses
Record ID
BIM-306756