0-day Security Detections at Scale Zdenek Letko Motivation Motivation Motivation Source: researchcenter.paloaltonetworks.com Motivation Detection Process Internet New App Detection Feature Vector Extraction Malware Goodware ? ModelDatabase Model training Model evaluation Enterprise Mobility Management Tool Agenda Malware ML-based Detection ML-based Detection as a Service Malware Internet Types of Mobile Malware • Ransomware • Spyware • Adware • Trojan • Rooting • SMS-fraud • Cryptojacking • Banker Training Internet ModelDatabase Model training Model evaluation Enterprise Mobility Management Tool Possible Data Sources ● Static analysis ○ App code (API calls, strings, domains, obfuscation, patterns, …) ○ App packaging (author, source, certificates, permissions, …) ● Dynamic analysis ○ Communication destinations IP/domain ○ Communication security / content / patterns ○ Device behaviour (battery drain, …) ● Manual inspection by threat ops team -- possible trusted labels for training data ● Internet databases and providers (free or paid = various quality) ○ IP/Domain blacklists ○ App analyses results, researches reports, … -- possible labels for the training data Feature Vector Encoding ● Unique app identifiers -- SHA / MD5 hashes ● Feature vector ○ Very sparse binary vector (over million of elements and growing) ● Categorical features ● Sparse feature domain Q. Shi, J. Petterson, G. Dror, J. Langford, A. Smola, and S. Vishwanathan: Hash kernels for structured data. Journal of Machine Learning Research, 10(Nov):2615–2637, 2009 Feature Hashing • Effective representation and encoding -- Feature Hashing • Handles increasing size of the binary vector • Fixed size of final feature vector • Pros • No dictionary (mapping) • Preserves sparsity • Cons • No inverse mapping • Hash collisions Hashing function - Number of Features after Hashing Hashing function - Number of Features after Hashing Classification Algorithms Classification Algorithms Model Training Loop ● Supervised learning ○ Balanced training set ● Logistic regression ○ Limited memory Broyden–Fletcher–Goldfarb–Shanno (LBFGS) algorithm ● Training Process & Termination ○ No improvement in recent iterations Training loss stabilised ○ Overfitting Test loss increases ○ Number of iteration Trained Model Evaluation -- QA ● Cross validation -- accuracy, precision, recall, ... ● Impact estimation -- validation on all previously classified samples ● Top 50 -- manually crafted sample set Source: wikipedia Online vs. Offline Learning Online learning is still challenging in our environment Offline learning with possibly high frequency of training Malware Detection as a Service Internet New App Detection Feature Vector Extraction Malware Goodware ? Model Enterprise Mobility Management Tool Clear Separation of ML & Production Technical Background - Full Automation PMML ● Header ● Data Dictionary ● Mining Schema ● Data Transformations ● Model ● Targets ● Output ● Model Explanation ● Model verification http://dmg.org/pmml/ Pre-processing Post-processing Scoring Service ● Model agnostic micro-service written in Java ○ Loads PMML (XML) and executes the model for given inputs ● “Low” resources requirements ● Super fast for small models ● Work distribution (data science vs. engineering) Monitoring • Metrics • Logs • Is it enough for DS? Lesson Learnt: Overfitting Investigation 1/2 ● Problem: Discrepancy between cross validation results and production results (accuracy, precision, recall) ● Contributing factors ○ Duplicates in training data (different app identifiers) ○ Size of feature space (hashing setting or model) ○ Data distribution 50/50 (goodware/malware) Lesson Learnt: Overfitting Investigation 2/2 ● Actions ○ Remove duplicates after hashing ○ Increased regularisation parameter (gradient descent parameter of logistic regression) ○ Data distribution reflects production Other Lessons Learnt ● Hashing is super useful ● Overfitting / Underfitting -- Check accuracy, precision, recall, … ● Data can become really huge ● Check Openscoring vs. Spark results -- Look for implementation bugs ● Automate everything and document your decisions ● Data scientists and threat ops like to see context in logs and they like to query those data (marketing and users love stories and visualisations) Thank You Internet New App Detection Feature Vector Extraction Malware Goodware ? ModelDatabase Model training Model evaluation Enterprise Mobility Management Tool Wandera Ecosystem Internet RESTful Microservices DatabasesMobile Gateway In Backend ML Enterprise Mobility Management Tool