0-day Security Detections at Scale
Zdenek Letko
Motivation
Motivation
Motivation
Source: researchcenter.paloaltonetworks.com
Motivation
Detection Process
Internet
New App
Detection
Feature
Vector
Extraction
Malware
Goodware
?
ModelDatabase
Model
training
Model
evaluation
Enterprise
Mobility
Management
Tool
Agenda
Malware
ML-based Detection
ML-based Detection as a Service
Malware
Internet
Types of Mobile Malware
• Ransomware
• Spyware
• Adware
• Trojan
• Rooting
• SMS-fraud
• Cryptojacking
• Banker
Training
Internet
ModelDatabase
Model
training
Model
evaluation
Enterprise
Mobility
Management
Tool
Possible Data Sources
● Static analysis
○ App code (API calls, strings, domains, obfuscation, patterns, …)
○ App packaging (author, source, certificates, permissions, …)
● Dynamic analysis
○ Communication destinations IP/domain
○ Communication security / content / patterns
○ Device behaviour (battery drain, …)
● Manual inspection by threat ops team -- possible trusted labels for training data
● Internet databases and providers (free or paid = various quality)
○ IP/Domain blacklists
○ App analyses results, researches reports, … -- possible labels for the training data
Feature Vector Encoding
● Unique app identifiers -- SHA / MD5 hashes
● Feature vector
○ Very sparse binary vector (over million of elements and growing)
● Categorical features
● Sparse feature domain
Q. Shi, J. Petterson, G. Dror, J. Langford, A. Smola, and S. Vishwanathan:
Hash kernels for structured data. Journal of Machine Learning Research, 10(Nov):2615–2637, 2009
Feature Hashing
• Effective representation and encoding -- Feature Hashing
• Handles increasing size of the binary vector
• Fixed size of final feature vector
• Pros
• No dictionary (mapping)
• Preserves sparsity
• Cons
• No inverse mapping
• Hash collisions
Hashing function - Number of Features after Hashing
Hashing function - Number of Features after Hashing
Classification Algorithms
Classification Algorithms
Model Training Loop
● Supervised learning
○ Balanced training set
● Logistic regression
○ Limited memory Broyden–Fletcher–Goldfarb–Shanno (LBFGS) algorithm
● Training Process & Termination
○ No improvement in recent iterations
Training loss stabilised
○ Overfitting
Test loss increases
○ Number of iteration
Trained Model Evaluation -- QA
● Cross validation -- accuracy, precision, recall, ...
● Impact estimation -- validation on all previously classified samples
● Top 50 -- manually crafted sample set
Source: wikipedia
Online vs. Offline Learning
Online learning is still challenging in our environment
Offline learning with possibly high frequency of training
Malware Detection as a Service
Internet
New App
Detection
Feature
Vector
Extraction
Malware
Goodware
?
Model
Enterprise
Mobility
Management
Tool
Clear Separation of ML & Production
Technical Background - Full Automation
PMML
● Header
● Data Dictionary
● Mining Schema
● Data Transformations
● Model
● Targets
● Output
● Model Explanation
● Model verification
http://dmg.org/pmml/
Pre-processing
Post-processing
Scoring Service
● Model agnostic micro-service written in Java
○ Loads PMML (XML) and executes the model for given inputs
● “Low” resources requirements
● Super fast for small models
● Work distribution (data science vs. engineering)
Monitoring
• Metrics
• Logs
• Is it enough for DS?
Lesson Learnt: Overfitting Investigation 1/2
● Problem: Discrepancy between cross validation results and production results
(accuracy, precision, recall)
● Contributing factors
○ Duplicates in training data (different app identifiers)
○ Size of feature space (hashing setting or model)
○ Data distribution 50/50 (goodware/malware)
Lesson Learnt: Overfitting Investigation 2/2
● Actions
○ Remove duplicates after hashing
○ Increased regularisation parameter (gradient descent parameter of logistic regression)
○ Data distribution reflects production
Other Lessons Learnt
● Hashing is super useful
● Overfitting / Underfitting -- Check accuracy, precision, recall, …
● Data can become really huge
● Check Openscoring vs. Spark results -- Look for implementation bugs
● Automate everything and document your decisions
● Data scientists and threat ops like to see context in logs and they like to query those data
(marketing and users love stories and visualisations)
Thank You
Internet
New App
Detection
Feature
Vector
Extraction
Malware
Goodware
?
ModelDatabase
Model
training
Model
evaluation
Enterprise
Mobility
Management
Tool
Wandera Ecosystem
Internet
RESTful
Microservices
DatabasesMobile Gateway
In Backend ML
Enterprise
Mobility
Management
Tool