Developing a virtual-machine-based distributed web crawling model

Other Title(s)

تطوير نموذج نظام زحف شبكي موزع معتمدا على آلة افتراضية

University

Middle East University

Faculty

Faculty of Information Technology

Department

Department of Computer Information Systems

University Country

Jordan

Degree

Master

Degree Date

2012

English Abstract

A Web crawler is an important component of a Web search engine.

It demands a large amount of hardware resources (CPU, disk storage, and memory) to crawl data from the daily evolved Web.

Furthermore, the crawling is a continuous process since it works on frequently changing Web pages.

An immediate solution to solve the crawling performance bottlenecks on CPU, the crawling process takes advantage of the recent advancement in processor manufacturing technologies that produced a relatively low-cost and powerful multi-core processors, which embrace two or more microprocessors (cores) on a single chip.

The system software (OS), can operate these cores concurrently which make programs run faster.

We believe that a gap exists between the system software and the hardware of a computer comprises multi-core processor, mainly, due to OS limitations as a result of that a significant percentage of processor hardware resources are not fully utilized.

One way to bridge the gap between the processor hardware resources and the system software seeking optimum processor utilization is by applying virtualization, i.e., dividing the processor into a number of virtual machines.

This thesis presents a description, implementation, and evaluation of a VM-based distributed Web crawling model.

In this model, the multi-core processor is divided into a number of VMs each performing part of the crawling process.

The VMs are configured in master-slave architecture, where one of the VMs is run as a master, while the other machines are run as slaves.

Two different implementation tools are developed for the crawling process: the first tool (SCrawler) runs on a computer comprises a multi-core processor with no VMs installed on (no virtualization); and the second tool (DCrawler) runs on a computer comprises a multi-core processor with a number of VMs installed on performing distributed crawling.

In order to evaluate the performance of the VM-based distributed crawling model, extensive crawling experiments were carried-out using SCrawler and DCrawler to estimate the actual crawling time for equivalent sequential and distributed computations, and then the speedup achieved by the VM-based crawling system over the same system with no virtualization, are computed for various number of crawled documents.

The distribution efficiency and average crawling rate in documents per unit time are also computed and compared.

The effect of the number of VMs is also investigated.

The results demonstrate that the VM-based distributed Web crawling model reduces crawling CPU time and increases the speed up and efficiency, and better utilize hardware resources and increase availability and reliability, and gives a significant increase in crawling rate (page per unit time), and crawling CPU time is reduced by nearly 32%.

Main Subjects

Information Technology and Computer Science

No. of Pages

Table of contents.

Abstract.

Abstract in Arabic.

Chapter One : Introduction.

Chapter Two : Literature review.

Chapter Three : The VM-based distributed crawling model.

Chapter Four : Results and discussions.

Chapter Five : Conclusions and recommendations for future work.

References.

American Psychological Association (APA)

al-Qutayshat, Hamzah Mahmud Abd Allah. (2012). Developing a virtual-machine-based distributed web crawling model. (Master's theses Theses and Dissertations Master). Middle East University, Jordan
https://search.emarefa.net/detail/BIM-694590

Modern Language Association (MLA)

al-Qutayshat, Hamzah Mahmud Abd Allah. Developing a virtual-machine-based distributed web crawling model. (Master's theses Theses and Dissertations Master). Middle East University. (2012).
https://search.emarefa.net/detail/BIM-694590

American Medical Association (AMA)

Language

English

Data Type

Arab Theses

Record ID

BIM-694590

SaveSaved Print

Arab Citation & Impact Factor "Arcif"

Largest Arabic Database of Citations Analysis for the Arabic Scholarly Journals Issued in Arab World.

eMarefa Indicators
for Arab Scientific Production

"Kashif" for Checking Similarity or Plagiarism in the Arabic Researches. know more

e-Marefa