Clustering XML documents based on structure
العناوين الأخرى
تجميع وثائق XML على أساس الهيكل
مقدم أطروحة جامعية
مشرف أطروحة جامعية
أعضاء اللجنة
al-Nayar, Muna Muhammad
Abd Allah, Nada Abd al-Zahrah
Hammud, Haydar Kazim
الجامعة
جامعة بغداد
الكلية
كلية العلوم
القسم الأكاديمي
قسم علوم الحاسبات
دولة الجامعة
العراق
الدرجة العلمية
ماجستير
تاريخ الدرجة العلمية
2014
الملخص الإنجليزي
The XML became a gateway for communication between applications, even applications on wildly different systems, so the number of XML (Extensible Markup Language) documents is growing on the Web; it becomes essential to effectively organize these XML documents in order to retrieve useful information from them.
A possible solution is to apply clustering on the XML documents to discover knowledge that promotes effective data management, information retrieval and query processing.
This thesis presents a system technique algorithm for clustering XML documents by structure, modeling the XML documents as rooted ordered labeled trees and usage of structural distance metrics in hierarchical clustering algorithms to detect groups of structurally similar XML documents.
The use of structural summaries for trees is to improve the performance of the distance calculation and at the same time to maintain or even improve its accuracy.
The work is conducted on real and synthetic XML documents.
Two sets of 1000 documents are generated from 10 real-case and synthetic DTDs (Document type definition), and three real dataset from ACM (Association for Computing Machinery).
XML documents are represented as tree model by using DOM (Document Object Model) which is C# object oriented tool, because DOM parser is faster than SAX (Simple API for XML) is C# object oriented tool, since it accesses whole XML documents in memory (load all nodes and relation between nodes in memory) that offered high speed to calculates tree edit distance.
Chawathe's algorithm is trees edit distance measurement between two rooted ordered trees with vertex labels to find the minimum cost of transforming one tree into the other by a sequence of elementary operations consisting of deleting and relabeling existing nodes, as well as inserting new nodes.
The average time for Chawathe's algorithm to derive the two structural summaries from two XML documents plus the time to calculate the structural distance between those two summaries is 95% less than the time to calculate the structural distance between two rooted ordered labeled trees of two XML documents (without using structural summaries).
Chawathe's algorithm with structure summaries proves high precision (PR=0.963) and high recall (R=0.963) Compared with results in similar businesses, such as “Evaluating Structural Similarity in XML Documents” by Andrew Nierman and H.
V.
Jagadish ,which indicates good performance and shows good clustering accuracy
التخصصات الرئيسية
تكنولوجيا المعلومات وعلم الحاسوب
عدد الصفحات
825
قائمة المحتويات
Table of contents.
Abstract.
Abstract in Arabic.
Chapter One : Introduction.
Chapter Two : XML structure and methods.
Chapter Three : XML clustering.
Chapter Four : Experimental evaluation.
Chapter Five : Conclusions and future works.
References.
نمط استشهاد جمعية علماء النفس الأمريكية (APA)
Najm, Yasir Abd al-Hamid. (2014). Clustering XML documents based on structure. (Master's theses Theses and Dissertations Master). University of Baghdad, Iraq
https://search.emarefa.net/detail/BIM-605966
نمط استشهاد الجمعية الأمريكية للغات الحديثة (MLA)
Najm, Yasir Abd al-Hamid. Clustering XML documents based on structure. (Master's theses Theses and Dissertations Master). University of Baghdad. (2014).
https://search.emarefa.net/detail/BIM-605966
نمط استشهاد الجمعية الطبية الأمريكية (AMA)
Najm, Yasir Abd al-Hamid. (2014). Clustering XML documents based on structure. (Master's theses Theses and Dissertations Master). University of Baghdad, Iraq
https://search.emarefa.net/detail/BIM-605966
لغة النص
الإنجليزية
نوع البيانات
رسائل جامعية
رقم السجل
BIM-605966
قاعدة معامل التأثير والاستشهادات المرجعية العربي "ارسيف Arcif"
أضخم قاعدة بيانات عربية للاستشهادات المرجعية للمجلات العلمية المحكمة الصادرة في العالم العربي
تقوم هذه الخدمة بالتحقق من التشابه أو الانتحال في الأبحاث والمقالات العلمية والأطروحات الجامعية والكتب والأبحاث باللغة العربية، وتحديد درجة التشابه أو أصالة الأعمال البحثية وحماية ملكيتها الفكرية. تعرف اكثر