Evaluation of a machine learning algorithm for a development of a web crawling system

مقدم أطروحة جامعية

Hamdushi, Mounia

مشرف أطروحة جامعية

Bin Said, Amini

الجامعة

جامعة الأخوين

الكلية

كلية الهندسة و العلوم

القسم الأكاديمي

علوم الحاسب

دولة الجامعة

المغرب

الدرجة العلمية

ماجستير

تاريخ الدرجة العلمية

2001

الملخص الإنجليزي

Search engines are widely used and becoming popular for searching for specific information or application in the World Wide Web.

Unfortunately, while these search engines offer high coverage, they often provide low precision.

In order to obtain more relevant pages first, and to concentrate on links that lead to documents of interest, this project investigates the use of machine learning techniques for a development of an intelligent web crawling system that populates the database of the specific domain search engine with the relevant documents.

Then, the machine learning tool was employed to produce an efficient web crawler and to automate Web page classification.

A learning algorithm is used to classify pages as belonging or not belonging to one of a number of predetermined topics or categories.

In our learning experiments, a Naive Bayes classifier is used on text data using bag of words document representation.

The Naive Bayes classifier is used to rate links in a page being scanned for their potential to lead to documents of interest.

As application of interest to the user, I choose the purchase of a laptop, and I collected and hand-labeled fi-om the web the necessary example pages.

Our experimental evaluation consists of cross-validation experiments.

We have experimented with two different data representations: (1) the text contained in the URL address, (2) the full text in the Web page corresponding to the URL address.

Our intelligent crawling approach gives better estimates of the potential of a link when scanning the whole text in the Web page corresponding to the URL than when using only the text contained in its URL address.

Additionally, for the second type of data representation, we reduce (in an automated way) the number of words selected to represent our documents (up to 95% vocabulary reduction).

We experiment with two different approaches to select words: (1) elimination of infi"equent words (2) use of feature selection method that determines the relative usefuhiess of features; in this work we have used the following two methods: term fi-equency-inverse document fi'equency and information gain.

Our learning algorithm achieves better results with smaller numbers of words, while for larger subsets the performance substantially drops (by removing 80% of words from our vocabulary we achieve 75% accuracy, whereas by using the whole words we achieve 59%).

Finally, we show that the feature scoring measure called Term Frequency-Inverse Document Frequency achieved the best performance among the two feature scoring measures used

التخصصات الرئيسية

تكنولوجيا المعلومات وعلم الحاسوب

الموضوعات

عدد الصفحات

66

قائمة المحتويات

Table of contents.

Abstract.

Chapter One : Introduction.

Chapter Two : Software project requirements.

Chapter Three : Crawling model.

Chapter Four : Text classification using machine learning.

Chapter Five : Feature selection.

Chapter Six : Experiments.

Chapter Seven : Experimental results and analysis.

Chapter Eight : Summary and conclusions.

References.

نمط استشهاد جمعية علماء النفس الأمريكية (APA)

Hamdushi, Mounia. (2001). Evaluation of a machine learning algorithm for a development of a web crawling system. (Master's theses Theses and Dissertations Master). Al Akhawayn University, Morocco
https://search.emarefa.net/detail/BIM-645957

نمط استشهاد الجمعية الأمريكية للغات الحديثة (MLA)

Hamdushi, Mounia. Evaluation of a machine learning algorithm for a development of a web crawling system. (Master's theses Theses and Dissertations Master). Al Akhawayn University. (2001).
https://search.emarefa.net/detail/BIM-645957

نمط استشهاد الجمعية الطبية الأمريكية (AMA)

Hamdushi, Mounia. (2001). Evaluation of a machine learning algorithm for a development of a web crawling system. (Master's theses Theses and Dissertations Master). Al Akhawayn University, Morocco
https://search.emarefa.net/detail/BIM-645957

لغة النص

الإنجليزية

نوع البيانات

رسائل جامعية

رقم السجل

BIM-645957