Spam detection using IP geolocation O-talk Andriy Stetsko Outline Possible sources of geolocation data Overview of available geolocation databases Results of tests of selected geoip databases   performance test coverage & accuracy tests Results of tests of Bayes classifier detection accuracy when   geolocation information IS NOT used geolocation information IS used 2 Sources of geolocation data (1) Public WHOIS database   maps IPs & real-world entries organization address ≠ geolocation of IP answer to location questions on web-site entries can be wrong HTTP Accept-Charset header sent by browser not always available can be falsified Users   Applications    3 Sources of geolocation data (2) Round trip time to landmark      ICMP echo message not all target hosts respond to ICMP echo target host may consider it as attack poor with regard to hosts with high latency connections target host can delay its replies obtain (purchase) network topology ISP  4 Overview of geolocation databases (1) IP2Location  18 databases GeoIP Country, GeoIP City GeoLite Country, GeoLite City MaxMind   Quova Digital-element IPligence   Lite, Basic, Max Lite Free 5 Overview of geolocation databases (2) Geobytes IPInfoDB (compiled from GeoLite City)   Country City ● 3 IP digits precision ● 4 IP digits precision Software77 IP::Country::Fast IP::Country::DB_File  built from publicly available statistics files of Regional Internet Registries 6 Databases to test 7 Performance test Test measures time needed to process 1000000 requests Test was repeated 10 times for each database 8 Coverage & accuracy tests (1) Test was done for 840 IP addresses Coverage = (840 - #Uncovered) / 840 Accuracy = (840 - #Incorrect) / 840 9 Coverage & accuracy tests (2) 35 30 25 20 15 10 5 0 DE NL GB AT SK ES EU 35 30 25 20 15 10 5 0 DE NL GB AT SK ES EU Database #3 Database #4 10 Bayes detection accuracy (1) SpamAssassin v. 3.3.1 running on Perl v. 5.10.1 RelayCountry plugin   analyses “Received” headers adds “X-Relay-Countries” header Database of e-mails  10762 spam & 9072 ham  e-mails contain headers added by spam detection software 11 Bayes detection accuracy (2) Test no. Train (%) 1 50 2 70 3 70 4 70 Test (%) 50 30 30 30 Auto-learning (E/D) D D D D Auto-expiration (E/D) E E D D Test #4: No difference between detection accuracy of Bayes classifier when RelayCountry plugin is used and when it is not used 12 Bayes detection accuracy (3) Test #1 and #2:    introduction of geolocation info increased detection accuracy but not so much ham recognition accuracy was the same in both cases detection accuracy in Test#2 was worse than in Test#1 Test #3:  detection accuracy was higher than in Tests #1 and #2 13 Conclusion & future work Geolocation info increases detection accuracy of SpamAssassin Bayes classifier  increase weights of tokens containing geolocation info Token expiration has great impact on detection accuracy of Bayes classifier  propose more effective expiration policy Explore correlation between e-mail charset and country code (returned by RelayCountry) of e-mail sender Explore correlation between TLD of sender e-mail and country code (returned by RelayCountry module) of e-mail sender Configuration tool for RelayCountry plugin 14 Thank you for your attention! 15