Clustering XML documents based on structure

Other Title(s)

تجميع وثائق XML على أساس الهيكل

University

University of Baghdad

Faculty

College of Science

Department

Department of Computer Science

University Country

Iraq

Degree

Master

Degree Date

2014

English Abstract

The XML became a gateway for communication between applications, even applications on wildly different systems, so the number of XML (Extensible Markup Language) documents is growing on the Web; it becomes essential to effectively organize these XML documents in order to retrieve useful information from them.

A possible solution is to apply clustering on the XML documents to discover knowledge that promotes effective data management, information retrieval and query processing.

This thesis presents a system technique algorithm for clustering XML documents by structure, modeling the XML documents as rooted ordered labeled trees and usage of structural distance metrics in hierarchical clustering algorithms to detect groups of structurally similar XML documents.

The use of structural summaries for trees is to improve the performance of the distance calculation and at the same time to maintain or even improve its accuracy.

The work is conducted on real and synthetic XML documents.

Two sets of 1000 documents are generated from 10 real-case and synthetic DTDs (Document type definition), and three real dataset from ACM (Association for Computing Machinery).

XML documents are represented as tree model by using DOM (Document Object Model) which is C# object oriented tool, because DOM parser is faster than SAX (Simple API for XML) is C# object oriented tool, since it accesses whole XML documents in memory (load all nodes and relation between nodes in memory) that offered high speed to calculates tree edit distance.

Chawathe's algorithm is trees edit distance measurement between two rooted ordered trees with vertex labels to find the minimum cost of transforming one tree into the other by a sequence of elementary operations consisting of deleting and relabeling existing nodes, as well as inserting new nodes.

The average time for Chawathe's algorithm to derive the two structural summaries from two XML documents plus the time to calculate the structural distance between those two summaries is 95% less than the time to calculate the structural distance between two rooted ordered labeled trees of two XML documents (without using structural summaries).

Chawathe's algorithm with structure summaries proves high precision (PR=0.963) and high recall (R=0.963) Compared with results in similar businesses, such as “Evaluating Structural Similarity in XML Documents” by Andrew Nierman and H.

Jagadish ,which indicates good performance and shows good clustering accuracy

Main Subjects

Information Technology and Computer Science

No. of Pages

825

Table of contents.

Abstract.

Abstract in Arabic.

Chapter One : Introduction.

Chapter Two : XML structure and methods.

Chapter Three : XML clustering.

Chapter Four : Experimental evaluation.

Chapter Five : Conclusions and future works.

References.

American Psychological Association (APA)

Najm, Yasir Abd al-Hamid. (2014). Clustering XML documents based on structure. (Master's theses Theses and Dissertations Master). University of Baghdad, Iraq
https://search.emarefa.net/detail/BIM-605966

Modern Language Association (MLA)

Najm, Yasir Abd al-Hamid. Clustering XML documents based on structure. (Master's theses Theses and Dissertations Master). University of Baghdad. (2014).
https://search.emarefa.net/detail/BIM-605966

American Medical Association (AMA)

Language

English

Data Type

Arab Theses

Record ID

BIM-605966

SaveSaved Print

Arab Citation & Impact Factor "Arcif"

Largest Arabic Database of Citations Analysis for the Arabic Scholarly Journals Issued in Arab World.

eMarefa Indicators
for Arab Scientific Production

"Kashif" for Checking Similarity or Plagiarism in the Arabic Researches. know more

e-Marefa