Developing a virtual-machine-based distributed web crawling model

العناوين الأخرى

تطوير نموذج نظام زحف شبكي موزع معتمدا على آلة افتراضية

الجامعة

جامعة الشرق الأوسط

الكلية

كلية تكنولوجيا المعلومات

القسم الأكاديمي

قسم نظم المعلومات الحاسوبية

دولة الجامعة

الأردن

الدرجة العلمية

ماجستير

تاريخ الدرجة العلمية

2012

الملخص الإنجليزي

A Web crawler is an important component of a Web search engine.

It demands a large amount of hardware resources (CPU, disk storage, and memory) to crawl data from the daily evolved Web.

Furthermore, the crawling is a continuous process since it works on frequently changing Web pages.

An immediate solution to solve the crawling performance bottlenecks on CPU, the crawling process takes advantage of the recent advancement in processor manufacturing technologies that produced a relatively low-cost and powerful multi-core processors, which embrace two or more microprocessors (cores) on a single chip.

The system software (OS), can operate these cores concurrently which make programs run faster.

We believe that a gap exists between the system software and the hardware of a computer comprises multi-core processor, mainly, due to OS limitations as a result of that a significant percentage of processor hardware resources are not fully utilized.

One way to bridge the gap between the processor hardware resources and the system software seeking optimum processor utilization is by applying virtualization, i.e., dividing the processor into a number of virtual machines.

This thesis presents a description, implementation, and evaluation of a VM-based distributed Web crawling model.

In this model, the multi-core processor is divided into a number of VMs each performing part of the crawling process.

The VMs are configured in master-slave architecture, where one of the VMs is run as a master, while the other machines are run as slaves.

Two different implementation tools are developed for the crawling process: the first tool (SCrawler) runs on a computer comprises a multi-core processor with no VMs installed on (no virtualization); and the second tool (DCrawler) runs on a computer comprises a multi-core processor with a number of VMs installed on performing distributed crawling.

In order to evaluate the performance of the VM-based distributed crawling model, extensive crawling experiments were carried-out using SCrawler and DCrawler to estimate the actual crawling time for equivalent sequential and distributed computations, and then the speedup achieved by the VM-based crawling system over the same system with no virtualization, are computed for various number of crawled documents.

The distribution efficiency and average crawling rate in documents per unit time are also computed and compared.

The effect of the number of VMs is also investigated.

The results demonstrate that the VM-based distributed Web crawling model reduces crawling CPU time and increases the speed up and efficiency, and better utilize hardware resources and increase availability and reliability, and gives a significant increase in crawling rate (page per unit time), and crawling CPU time is reduced by nearly 32%.

التخصصات الرئيسية

تكنولوجيا المعلومات وعلم الحاسوب

عدد الصفحات

قائمة المحتويات

Table of contents.

Abstract.

Abstract in Arabic.

Chapter One : Introduction.

Chapter Two : Literature review.

Chapter Three : The VM-based distributed crawling model.

Chapter Four : Results and discussions.

Chapter Five : Conclusions and recommendations for future work.

References.

نمط استشهاد جمعية علماء النفس الأمريكية (APA)

al-Qutayshat, Hamzah Mahmud Abd Allah. (2012). Developing a virtual-machine-based distributed web crawling model. (Master's theses Theses and Dissertations Master). Middle East University, Jordan
https://search.emarefa.net/detail/BIM-694590

نمط استشهاد الجمعية الأمريكية للغات الحديثة (MLA)

al-Qutayshat, Hamzah Mahmud Abd Allah. Developing a virtual-machine-based distributed web crawling model. (Master's theses Theses and Dissertations Master). Middle East University. (2012).
https://search.emarefa.net/detail/BIM-694590

نمط استشهاد الجمعية الطبية الأمريكية (AMA)

لغة النص

الإنجليزية

نوع البيانات

رسائل جامعية

رقم السجل

BIM-694590

حفظتم الحفظ طباعة

قاعدة معامل التأثير والاستشهادات المرجعية العربي "ارسيف Arcif"

أضخم قاعدة بيانات عربية للاستشهادات المرجعية للمجلات العلمية المحكمة الصادرة في العالم العربي

مرصد "معرفة"
لقياس الإنتاج العلمي العربي

تقوم هذه الخدمة بالتحقق من التشابه أو الانتحال في الأبحاث والمقالات العلمية والأطروحات الجامعية والكتب والأبحاث باللغة العربية، وتحديد درجة التشابه أو أصالة الأعمال البحثية وحماية ملكيتها الفكرية. تعرف اكثر