Website Classification Mgr. Juraj Hreško`s thesis presented by Jaromír Navrátil 7.2.2013 Synopsis •Task •Possible solutions •Solution •Rare classes •Possible improvements •Rewriting to C++ The Task •create application to classify czech websites •61 classes •multi-labeling (1-3 classes for each document) •real-time classification •be able to adjust the algorithm to maximize precision or recall Occurrences Class 678 Insurance 1170 Job / Career 6003 Kids / Toys / Family 1059 Military / Guns 1974 Mobile Phones / Operators 11826 Music / Radio / Cinema / TV 3477 News / Magazines 54 Peer-to-peer 10002 Personal / Dating / Lifestyle 2049 Politics / Law 4077 Pornography 4227 Portals / Search Engines 90 Proxies 2475 Real Estate 6966 Regional 1803 Religious / Spirituality 6405 Sale / Auctions 6 Sects 48 Sex Education 42240 Shopping 288 Social Networks 14913 Sports 120 Streaming / Broadcasting 951 Swimwear / Intimate 384 Translation Services 24537 Travelling / Vacation 1788 Uploading / Downloading 816 Warez / Piracy 135 Web Based Mail 888 Web Hosting 1110 Money / Financial Occurrences Class 3159 Advertisement 1281 Alcohol / Tobacco 2442 Arts 9756 Cars / Vehicles 1590 Banking 450 Brokers 27066 Building / Home 15045 Business 16998 Chats / Blogs / Forums 1068 Communications 72 Crime 11805 Education 2613 Entertainment 5553 Environment 1575 Erotic / Adult / Nudity 459 Extreme / Hate / Violence 13302 Fashion / Beauty 12708 Food / Restaurants 2298 Foundations / Charity / Social Services 135 Gambling 3090 Games 6108 Government 18 Hacking / Phishing / Fraud 9225 Health / Medicine 13794 Hobbies 2376 Humour / Cool 13995 IT / Hardware / Software 5163 IT Services / Internet 195 Illegal Drugs 90 Instant Messaging Classes Possible Approaches •Web structure mining - links •Web content mining - text, html, multimedia •Web usage mining - access logs •combining first two approaches would be ideal, but mining from structure is computationally difficult Multi-label Classification •Algorithms from WEKA are not able to process multi-label data, thus we have to transform the problem or adapt the algorithm •Transforming the problem: -choose one class for each example, forgetting others -delete all examples with more than one class -change every combination of classes into one new class -use one classificator for each class Categories Occurence 0 0.41% 1 64.45% 2 31.75% 3 3.38% 4-6 0.01% Components of Classifier •downloader •modeller •feature selection •machine learning DB Downloader Modeller Feature Selection Machine Learning Classifier Rules URL label content model terms class features list knowledge representation Downloader •download website using wget •get language coding (mostly Windows-1250, ISO 8859-2 or UTF-8) •transfer to UTF-8 using Enca Modeller - source to vertical •transfer text to so-called vertical •delete HTML tags, scripts, parts of CSS, interpunction •divide words by spaces and convert them to lower-case Interesting article.<title> </head> <body> <h1>The article</h1> This is the main part of the article. </body> </html> vertical word Tag interesting title article title the h1 article h1 this none is none the none main none part none of none article none page source Modeller - vertical to vector •trasfer vertical into vector model using Structure-oriented Weighting Technique •delete words with high frequency across classes – not used •stemming (lemmatization) – not used ek title h1 h2 h3 none w(ek) 10 5 3 2 1 vertical word Tag interesting title article title the h1 article h1 this none is none the none main none part none of none article none vector model word weight article 16 interesting 10 the 6 this 1 is 1 main 1 part 1 of 1 article 1 Feature Selection •eliminate attributes with fewer than 50 ocurrences, lessening number of words in dictionary from 1 263 296 to 63 121 •compute information gain for each term •choose 2000 best terms Choosing Classifier •choose 5 categories with average number of positive and negative examples Category Precision Recall F1 Measure Arts 0.792 0.790 0.790 Entertainment 0.762 0.759 0.758 Foundations 0.761 0.758 0.758 Games 0.798 0.797 0.796 HW-SW 0.809 0.807 0.807 Mean 0.784 0.782 0.782 Category Precision Recall F1 Measure Arts 0.831 0.814 0.812 Entertainment 0.793 0.768 0.763 Foundations 0.811 0.792 0.789 Games 0.787 0.768 0.765 HW-SW 0.767 0.764 0.763 Mean 0.798 0.781 0.778 Category Precision Recall F1 Measure Arts 0.812 0.810 0.810 Entertainment 0.767 0.766 0.766 Foundations 0.766 0.764 0.764 Games 0.814 0.811 0.811 HW-SW 0.782 0.782 0.781 Mean 0.788 0.787 0.786 Category Precision Recall F1 Measure Arts 0.851 0.848 0.847 Entertainment 0.817 0.815 0.815 Foundations 0.821 0.821 0.821 Games 0.851 0.851 0.851 HW-SW 0.843 0.842 0.841 Mean 0.837 0.835 0.835 Category Precision Recall F1 Measure Arts 0.762 0.762 0.759 Entertainment 0.765 0.761 0.760 Foundations 0.742 0.742 0.741 Games 0.763 0.744 0.740 HW-SW 0.792 0.782 0.780 Mean 0.765 0.758 0.756 SVM - linear J48 SVM - sigmoid Random forest Naive Bayes Random Forest System Evaluation •cross-validation –training : testing data set to 1:4 –precision 59.68% •second approach took each class as one problem Complications •classes with very low number of positive examples •some pages stopped existing •system cannot handle HTTPS protocol, nor redirection •existing solution was very slow when it came to classifying multiple webpages Rare Classes •task is to examine two classes with low occurence •Illegal Drugs (418 URLs) –some pages do not exist anymore, are redirected or requires confirmation –only 96 pages (23%) were classified correctly •Alcohol / Tobacco (5631 URLs) –some websites caused utility wget to enter infinite loop –2289 pages (41%) classified correctly –category Shopping assigned many times, along with Social Networks Rare Classes - data Category Times Assigned Illegal Drugs 96 Shopping 56 Health / Medicine 43 Social Networks 39 Chats / Blogs / Forums 19 Alcohol / Tobacco 11 News / Magazines 10 Streaming / Broadcasting 10 other (classified < 10 times) 122 empty pages 59 Illegal Drugs (418 examples) Category Times Assigned Alcohol / Tobacco 2289 Shopping 648 Social Networks 461 Food / Restaurants 316 Health / Medicine 203 Travelling / Vacacion 167 Chats / Blogs / Forums 105 Streaming / Broadcasting 51 other (classified < 50 times) 375 empty pages 402 Alcohol / Tobacco (5631 examples) •classification of six thousand pages runned for about 18 hours (it would be much longer if SSD was not used) Possible Improvements •remove obstacles preventing downloading some pages, such as use of HTTPS, redirection, age prompt •relearn forests using verified data •use faster classifier or parallelize Random Forest •rewrite system from Python and Bash to C++ •improve feature selection Forest classifying rare class Sects inkjet not in class in class >= 4 < 4 inkjet not in class in class >= 4 < 4 inkjet not in class in class >= 4 < 4 ... •rapid increase of speed (now 42 examples per min., was 5.5) •somehow different results using same URLs Rewriting to C++ Category Times Assigned Illegal Drugs 96 Shopping 56 Health / Medicine 43 Social Networks 39 Chats / Blogs / Forums 19 Alcohol / Tobacco 11 News / Magazines 10 Streaming / Broadcasting 10 other (classified < 10 times) 122 Category Times Assigned Social Networks 128 Illegal Drugs 81 Health / Medicine 33 Shopping 31 Chats / Blogs / Forums 17 Alcohol / Tobacco 9 Web Based Mail 7 Streaming / Broadcasting 7 other (classified < 7 times) 58 former solution (~1h 15min) C++ version (9min 50s) Conclusion •rewriting system to C++ made it viable for real-time application •the main problem is preprocessing now –downloading webpage takes most time –using more pages from same domain could improve accuracy –utility wget enters infinite loop on some sites •classifier itself could be improved as well –independent list of attributes for each class –another algorithm can be tried (e.g. Bayesian classifier) •dividing program into parts operating independently would slightly improve speed Sources •Thesis of Mgr. Juraj Hreško, Masaryk University, Faculty of Informatics, Brno, 2012 •http://weka.wikispaces.com/ •http://www.ide.bth.se/~hgr/Papers/cuda-rf_mcc10_v1.0_crc.pdf •http://www.cplusplus.com/