Evaluation of a machine learning algorithm for a development of a web crawling system

Dissertant

Hamdushi, Mounia

Thesis advisor

Bin Said, Amini

University

Al Akhawayn University

Faculty

School of Science and Engineering

Department

Computer Science

University Country

Morocco

Degree

Master

Degree Date

2001

English Abstract

Search engines are widely used and becoming popular for searching for specific information or application in the World Wide Web.

Unfortunately, while these search engines offer high coverage, they often provide low precision.

In order to obtain more relevant pages first, and to concentrate on links that lead to documents of interest, this project investigates the use of machine learning techniques for a development of an intelligent web crawling system that populates the database of the specific domain search engine with the relevant documents.

Then, the machine learning tool was employed to produce an efficient web crawler and to automate Web page classification.

A learning algorithm is used to classify pages as belonging or not belonging to one of a number of predetermined topics or categories.

In our learning experiments, a Naive Bayes classifier is used on text data using bag of words document representation.

The Naive Bayes classifier is used to rate links in a page being scanned for their potential to lead to documents of interest.

As application of interest to the user, I choose the purchase of a laptop, and I collected and hand-labeled fi-om the web the necessary example pages.

Our experimental evaluation consists of cross-validation experiments.

We have experimented with two different data representations: (1) the text contained in the URL address, (2) the full text in the Web page corresponding to the URL address.

Our intelligent crawling approach gives better estimates of the potential of a link when scanning the whole text in the Web page corresponding to the URL than when using only the text contained in its URL address.

Additionally, for the second type of data representation, we reduce (in an automated way) the number of words selected to represent our documents (up to 95% vocabulary reduction).

We experiment with two different approaches to select words: (1) elimination of infi"equent words (2) use of feature selection method that determines the relative usefuhiess of features; in this work we have used the following two methods: term fi-equency-inverse document fi'equency and information gain.

Our learning algorithm achieves better results with smaller numbers of words, while for larger subsets the performance substantially drops (by removing 80% of words from our vocabulary we achieve 75% accuracy, whereas by using the whole words we achieve 59%).

Finally, we show that the feature scoring measure called Term Frequency-Inverse Document Frequency achieved the best performance among the two feature scoring measures used

Main Subjects

Information Technology and Computer Science

Topics

No. of Pages

66

Table of Contents

Table of contents.

Abstract.

Chapter One : Introduction.

Chapter Two : Software project requirements.

Chapter Three : Crawling model.

Chapter Four : Text classification using machine learning.

Chapter Five : Feature selection.

Chapter Six : Experiments.

Chapter Seven : Experimental results and analysis.

Chapter Eight : Summary and conclusions.

References.

American Psychological Association (APA)

Hamdushi, Mounia. (2001). Evaluation of a machine learning algorithm for a development of a web crawling system. (Master's theses Theses and Dissertations Master). Al Akhawayn University, Morocco
https://search.emarefa.net/detail/BIM-645957

Modern Language Association (MLA)

Hamdushi, Mounia. Evaluation of a machine learning algorithm for a development of a web crawling system. (Master's theses Theses and Dissertations Master). Al Akhawayn University. (2001).
https://search.emarefa.net/detail/BIM-645957

American Medical Association (AMA)

Hamdushi, Mounia. (2001). Evaluation of a machine learning algorithm for a development of a web crawling system. (Master's theses Theses and Dissertations Master). Al Akhawayn University, Morocco
https://search.emarefa.net/detail/BIM-645957

Language

English

Data Type

Arab Theses

Record ID

BIM-645957