Website Classification
Mgr. Juraj Hreško`s thesis
presented by Jaromír Navrátil
7.2.2013
Synopsis
•Task
•Possible solutions
•Solution
•Rare classes
•Possible improvements
•Rewriting to C++
The Task
•create application to classify czech websites
•61 classes
•multi-labeling (1-3 classes for each document)
•real-time classification
•be able to adjust the algorithm to maximize precision or recall
Occurrences Class
678 Insurance
1170 Job / Career
6003 Kids / Toys / Family
1059 Military / Guns
1974 Mobile Phones / Operators
11826 Music / Radio / Cinema / TV
3477 News / Magazines
54 Peer-to-peer
10002 Personal / Dating / Lifestyle
2049 Politics / Law
4077 Pornography
4227 Portals / Search Engines
90 Proxies
2475 Real Estate
6966 Regional
1803 Religious / Spirituality
6405 Sale / Auctions
6 Sects
48 Sex Education
42240 Shopping
288 Social Networks
14913 Sports
120 Streaming / Broadcasting
951 Swimwear / Intimate
384 Translation Services
24537 Travelling / Vacation
1788 Uploading / Downloading
816 Warez / Piracy
135 Web Based Mail
888 Web Hosting
1110 Money / Financial
Occurrences Class
3159 Advertisement
1281 Alcohol / Tobacco
2442 Arts
9756 Cars / Vehicles
1590 Banking
450 Brokers
27066 Building / Home
15045 Business
16998 Chats / Blogs / Forums
1068 Communications
72 Crime
11805 Education
2613 Entertainment
5553 Environment
1575 Erotic / Adult / Nudity
459 Extreme / Hate / Violence
13302 Fashion / Beauty
12708 Food / Restaurants
2298 Foundations / Charity / Social Services
135 Gambling
3090 Games
6108 Government
18 Hacking / Phishing / Fraud
9225 Health / Medicine
13794 Hobbies
2376 Humour / Cool
13995 IT / Hardware / Software
5163 IT Services / Internet
195 Illegal Drugs
90 Instant Messaging
Classes
Possible Approaches
•Web structure mining - links
•Web content mining - text, html, multimedia
•Web usage mining - access logs
•combining first two approaches would be ideal, but mining from structure is computationally
difficult
Multi-label Classification
•Algorithms from WEKA are not able to process multi-label data, thus we have to transform the
problem or adapt the algorithm
•Transforming the problem:
-choose one class for each example, forgetting others
-delete all examples with more than one class
-change every combination of classes into one new class
-use one classificator for each class
Categories
Occurence
0
0.41%
1
64.45%
2
31.75%
3
3.38%
4-6
0.01%
Components of Classifier
•downloader
•modeller
•feature selection
•machine learning
DB
Downloader
Modeller
Feature
Selection
Machine
Learning
Classifier Rules
URL
label
content
model
terms
class features list
knowledge representation
Downloader
•download website using wget
•get language coding (mostly Windows-1250, ISO 8859-2 or UTF-8)
•transfer to UTF-8 using Enca
Modeller - source to vertical
•transfer text to so-called vertical
•delete HTML tags, scripts, parts of CSS, interpunction
•divide words by spaces and convert them to lower-case
Interesting article.
The article
This is the main part of the article.
vertical
word
Tag
interesting
title
article
title
the
h1
article
h1
this
none
is
none
the
none
main
none
part
none
of
none
article
none
page source
Modeller - vertical to vector
•trasfer vertical into vector model using Structure-oriented Weighting Technique
•delete words with high frequency across classes – not used
•stemming (lemmatization) – not used
ek
title
h1
h2
h3
none
w(ek)
10
5
3
2
1
vertical
word
Tag
interesting
title
article
title
the
h1
article
h1
this
none
is
none
the
none
main
none
part
none
of
none
article
none
vector model
word
weight
article
16
interesting
10
the
6
this
1
is
1
main
1
part
1
of
1
article
1
Feature Selection
•eliminate attributes with fewer than 50 ocurrences, lessening number of words in dictionary from 1
263 296 to 63 121
•compute information gain for each term
•choose 2000 best terms
Choosing Classifier
•choose 5 categories with average number of positive and negative examples
Category
Precision
Recall
F1 Measure
Arts
0.792
0.790
0.790
Entertainment
0.762
0.759
0.758
Foundations
0.761
0.758
0.758
Games
0.798
0.797
0.796
HW-SW
0.809
0.807
0.807
Mean
0.784
0.782
0.782
Category
Precision
Recall
F1 Measure
Arts
0.831
0.814
0.812
Entertainment
0.793
0.768
0.763
Foundations
0.811
0.792
0.789
Games
0.787
0.768
0.765
HW-SW
0.767
0.764
0.763
Mean
0.798
0.781
0.778
Category
Precision
Recall
F1 Measure
Arts
0.812
0.810
0.810
Entertainment
0.767
0.766
0.766
Foundations
0.766
0.764
0.764
Games
0.814
0.811
0.811
HW-SW
0.782
0.782
0.781
Mean
0.788
0.787
0.786
Category
Precision
Recall
F1 Measure
Arts
0.851
0.848
0.847
Entertainment
0.817
0.815
0.815
Foundations
0.821
0.821
0.821
Games
0.851
0.851
0.851
HW-SW
0.843
0.842
0.841
Mean
0.837
0.835
0.835
Category
Precision
Recall
F1 Measure
Arts
0.762
0.762
0.759
Entertainment
0.765
0.761
0.760
Foundations
0.742
0.742
0.741
Games
0.763
0.744
0.740
HW-SW
0.792
0.782
0.780
Mean
0.765
0.758
0.756
SVM - linear
J48
SVM - sigmoid
Random forest
Naive Bayes
Random Forest
System Evaluation
•cross-validation
–training : testing data set to 1:4
–precision 59.68%
•second approach took each class as one problem
Complications
•classes with very low number of positive examples
•some pages stopped existing
•system cannot handle HTTPS protocol, nor redirection
•existing solution was very slow when it came to classifying multiple webpages
Rare Classes
•task is to examine two classes with low occurence
•Illegal Drugs (418 URLs)
–some pages do not exist anymore, are redirected or requires confirmation
–only 96 pages (23%) were classified correctly
•Alcohol / Tobacco (5631 URLs)
–some websites caused utility wget to enter infinite loop
–2289 pages (41%) classified correctly
–category Shopping assigned many times, along with Social Networks
Rare Classes - data
Category
Times Assigned
Illegal Drugs
96
Shopping
56
Health / Medicine
43
Social Networks
39
Chats / Blogs / Forums
19
Alcohol / Tobacco
11
News / Magazines
10
Streaming / Broadcasting
10
other (classified < 10 times)
122
empty pages
59
Illegal Drugs (418 examples)
Category
Times Assigned
Alcohol / Tobacco
2289
Shopping
648
Social Networks
461
Food / Restaurants
316
Health / Medicine
203
Travelling / Vacacion
167
Chats / Blogs / Forums
105
Streaming / Broadcasting
51
other (classified < 50 times)
375
empty pages
402
Alcohol / Tobacco (5631 examples)
•classification of six thousand pages runned for about 18 hours (it would be much longer if SSD was
not used)
Possible Improvements
•remove obstacles preventing downloading some pages, such as use of HTTPS, redirection, age prompt
•relearn forests using verified data
•use faster classifier or parallelize Random Forest
•rewrite system from Python and Bash to C++
•improve feature selection
Forest classifying rare class Sects
inkjet
not in class
in class
>= 4
< 4
inkjet
not in class
in class
>= 4
< 4
inkjet
not in class
in class
>= 4
< 4
...
•rapid increase of speed (now 42 examples per min., was 5.5)
•somehow different results using same URLs
Rewriting to C++
Category
Times Assigned
Illegal Drugs
96
Shopping
56
Health / Medicine
43
Social Networks
39
Chats / Blogs / Forums
19
Alcohol / Tobacco
11
News / Magazines
10
Streaming / Broadcasting
10
other (classified < 10 times)
122
Category
Times Assigned
Social Networks
128
Illegal Drugs
81
Health / Medicine
33
Shopping
31
Chats / Blogs / Forums
17
Alcohol / Tobacco
9
Web Based Mail
7
Streaming / Broadcasting
7
other (classified < 7 times)
58
former solution (~1h 15min)
C++ version (9min 50s)
Conclusion
•rewriting system to C++ made it viable for real-time application
•the main problem is preprocessing now
–downloading webpage takes most time
–using more pages from same domain could improve accuracy
–utility wget enters infinite loop on some sites
•classifier itself could be improved as well
–independent list of attributes for each class
–another algorithm can be tried (e.g. Bayesian classifier)
•dividing program into parts operating independently would slightly improve speed
Sources
•Thesis of Mgr. Juraj Hreško, Masaryk University, Faculty of Informatics, Brno, 2012
•http://weka.wikispaces.com/
•http://www.ide.bth.se/~hgr/Papers/cuda-rf_mcc10_v1.0_crc.pdf
•http://www.cplusplus.com/