Clustering XML documents based on structure

العناوين الأخرى

تجميع وثائق XML على أساس الهيكل

الجامعة

جامعة بغداد

الكلية

كلية العلوم

القسم الأكاديمي

قسم علوم الحاسبات

دولة الجامعة

العراق

الدرجة العلمية

ماجستير

تاريخ الدرجة العلمية

2014

الملخص الإنجليزي

The XML became a gateway for communication between applications, even applications on wildly different systems, so the number of XML (Extensible Markup Language) documents is growing on the Web; it becomes essential to effectively organize these XML documents in order to retrieve useful information from them.

A possible solution is to apply clustering on the XML documents to discover knowledge that promotes effective data management, information retrieval and query processing.

This thesis presents a system technique algorithm for clustering XML documents by structure, modeling the XML documents as rooted ordered labeled trees and usage of structural distance metrics in hierarchical clustering algorithms to detect groups of structurally similar XML documents.

The use of structural summaries for trees is to improve the performance of the distance calculation and at the same time to maintain or even improve its accuracy.

The work is conducted on real and synthetic XML documents.

Two sets of 1000 documents are generated from 10 real-case and synthetic DTDs (Document type definition), and three real dataset from ACM (Association for Computing Machinery).

XML documents are represented as tree model by using DOM (Document Object Model) which is C# object oriented tool, because DOM parser is faster than SAX (Simple API for XML) is C# object oriented tool, since it accesses whole XML documents in memory (load all nodes and relation between nodes in memory) that offered high speed to calculates tree edit distance.

Chawathe's algorithm is trees edit distance measurement between two rooted ordered trees with vertex labels to find the minimum cost of transforming one tree into the other by a sequence of elementary operations consisting of deleting and relabeling existing nodes, as well as inserting new nodes.

The average time for Chawathe's algorithm to derive the two structural summaries from two XML documents plus the time to calculate the structural distance between those two summaries is 95% less than the time to calculate the structural distance between two rooted ordered labeled trees of two XML documents (without using structural summaries).

Chawathe's algorithm with structure summaries proves high precision (PR=0.963) and high recall (R=0.963) Compared with results in similar businesses, such as “Evaluating Structural Similarity in XML Documents” by Andrew Nierman and H.

Jagadish ,which indicates good performance and shows good clustering accuracy

التخصصات الرئيسية

تكنولوجيا المعلومات وعلم الحاسوب

عدد الصفحات

825

قائمة المحتويات

Table of contents.

Abstract.

Abstract in Arabic.

Chapter One : Introduction.

Chapter Two : XML structure and methods.

Chapter Three : XML clustering.

Chapter Four : Experimental evaluation.

Chapter Five : Conclusions and future works.

References.

نمط استشهاد جمعية علماء النفس الأمريكية (APA)

Najm, Yasir Abd al-Hamid. (2014). Clustering XML documents based on structure. (Master's theses Theses and Dissertations Master). University of Baghdad, Iraq
https://search.emarefa.net/detail/BIM-605966

نمط استشهاد الجمعية الأمريكية للغات الحديثة (MLA)

Najm, Yasir Abd al-Hamid. Clustering XML documents based on structure. (Master's theses Theses and Dissertations Master). University of Baghdad. (2014).
https://search.emarefa.net/detail/BIM-605966

نمط استشهاد الجمعية الطبية الأمريكية (AMA)

لغة النص

الإنجليزية

نوع البيانات

رسائل جامعية

رقم السجل

BIM-605966

حفظتم الحفظ طباعة

قاعدة معامل التأثير والاستشهادات المرجعية العربي "ارسيف Arcif"

أضخم قاعدة بيانات عربية للاستشهادات المرجعية للمجلات العلمية المحكمة الصادرة في العالم العربي

مرصد "معرفة"
لقياس الإنتاج العلمي العربي

تقوم هذه الخدمة بالتحقق من التشابه أو الانتحال في الأبحاث والمقالات العلمية والأطروحات الجامعية والكتب والأبحاث باللغة العربية، وتحديد درجة التشابه أو أصالة الأعمال البحثية وحماية ملكيتها الفكرية. تعرف اكثر