Evaluation of a machine learning algorithm for a development of a web crawling system
Dissertant
Thesis advisor
University
Al Akhawayn University
Faculty
School of Science and Engineering
Department
Computer Science
University Country
Morocco
Degree
Master
Degree Date
2001
English Abstract
Search engines are widely used and becoming popular for searching for specific information or application in the World Wide Web.
Unfortunately, while these search engines offer high coverage, they often provide low precision.
In order to obtain more relevant pages first, and to concentrate on links that lead to documents of interest, this project investigates the use of machine learning techniques for a development of an intelligent web crawling system that populates the database of the specific domain search engine with the relevant documents.
Then, the machine learning tool was employed to produce an efficient web crawler and to automate Web page classification.
A learning algorithm is used to classify pages as belonging or not belonging to one of a number of predetermined topics or categories.
In our learning experiments, a Naive Bayes classifier is used on text data using bag of words document representation.
The Naive Bayes classifier is used to rate links in a page being scanned for their potential to lead to documents of interest.
As application of interest to the user, I choose the purchase of a laptop, and I collected and hand-labeled fi-om the web the necessary example pages.
Our experimental evaluation consists of cross-validation experiments.
We have experimented with two different data representations: (1) the text contained in the URL address, (2) the full text in the Web page corresponding to the URL address.
Our intelligent crawling approach gives better estimates of the potential of a link when scanning the whole text in the Web page corresponding to the URL than when using only the text contained in its URL address.
Additionally, for the second type of data representation, we reduce (in an automated way) the number of words selected to represent our documents (up to 95% vocabulary reduction).
We experiment with two different approaches to select words: (1) elimination of infi"equent words (2) use of feature selection method that determines the relative usefuhiess of features; in this work we have used the following two methods: term fi-equency-inverse document fi'equency and information gain.
Our learning algorithm achieves better results with smaller numbers of words, while for larger subsets the performance substantially drops (by removing 80% of words from our vocabulary we achieve 75% accuracy, whereas by using the whole words we achieve 59%).
Finally, we show that the feature scoring measure called Term Frequency-Inverse Document Frequency achieved the best performance among the two feature scoring measures used
Main Subjects
Information Technology and Computer Science
Topics
No. of Pages
66
Table of Contents
Table of contents.
Abstract.
Chapter One : Introduction.
Chapter Two : Software project requirements.
Chapter Three : Crawling model.
Chapter Four : Text classification using machine learning.
Chapter Five : Feature selection.
Chapter Six : Experiments.
Chapter Seven : Experimental results and analysis.
Chapter Eight : Summary and conclusions.
References.
American Psychological Association (APA)
Hamdushi, Mounia. (2001). Evaluation of a machine learning algorithm for a development of a web crawling system. (Master's theses Theses and Dissertations Master). Al Akhawayn University, Morocco
https://search.emarefa.net/detail/BIM-645957
Modern Language Association (MLA)
Hamdushi, Mounia. Evaluation of a machine learning algorithm for a development of a web crawling system. (Master's theses Theses and Dissertations Master). Al Akhawayn University. (2001).
https://search.emarefa.net/detail/BIM-645957
American Medical Association (AMA)
Hamdushi, Mounia. (2001). Evaluation of a machine learning algorithm for a development of a web crawling system. (Master's theses Theses and Dissertations Master). Al Akhawayn University, Morocco
https://search.emarefa.net/detail/BIM-645957
Language
English
Data Type
Arab Theses
Record ID
BIM-645957