DISTRIBUTEDEVENT-DRIVENMONITORING INTRO DanielTovarňák LAB OF SOFTWAREARCHITECTURES AND INFORMATION SYSTEMS FACULTY OF INFORMATICS MASARYK UNIVERSITY Monitoring(ofdistributedinfrastructure) • Continuous and systematic collection, analysis, and evaluation of data related to the state and behavior of respective constituents of said infrastructure. • Enterprise networks • Internet ofThings • Smart Grid (energy grid) • Cloud infrastructure P P P P P P P P P P P P P P P P P P C P P P P P P P P ? MonitoringinGeneral Goal –intelligentbehaviormonitoring • Detection of (known) behavior patterns in the produced monitoring data in real-time • Dictionary attack, DDoS detection, Job state • Monitoring information: User Bob has logged in • Pattern: User X failed to log in 1000 times within 1 minute • Low overhead imposed on monitored machines and network • Several problems hinder achievement of such a goal MonitoringofCloudinfrastructure • Huge volumes of data produced by many distributed producers (virtual machines) • High variability of monitoring data • Hardware, OS, Middle-ware,Web server, Application-level • The entity of interest is usually spread across many computing nodes • Hadoop job, Custom distributed algorithm, Replicated DB • Specific trust model Problems • Technical • mainly with respect to the monitoring data production • e.g. logging in natural language • Conceptual • related to 3V of Big Data • e.g. scalability, and query expressiveness/complexity Monitoringdatacollection • Huge volumes of data (up to 1MB/s perVM) • typically 100-1000 producers • Centralized • Limited scalability • Selective (eg. Publish-subscribe) • Still centralized (data-wise) • Distributed (eg. Hadoop Distributed File System) • Possible solution, in combination with pub-sub Distributedprocessing • Traditional DBMSs (distributed or not) are not very suitable for continuous queries (from the performance perspective) • Solutions based on distributed collection and batch processing (MapReduce) have high latency (~mins) • Off-line vs. On-line algorithms DistributedEvent-drivenMonitoringModel • Stream (online) processing of monitoring data in the form of events – everything is an event • Techniques and algorithms for complex event processing • Fully distributed processing using special variant of publish-subscribe (pattern-based) Event-driven • We consider everything to be an event • Measurement/metric (it is a predefined change) • State (its change) • Event (duh…) • Complex Event Processing • simple events are composed into more complex ones • final complex event = detected pattern DistributedEvent-drivenMonitoringModel PA PA PA PA C C C C P 1 3 P 2 2 P 1 2 PA SUBSCRIBEPUBLISH P 3 2 behavior & state events complex events complex queries sub-queries 1 1 2 3 1|2|3 – access control pub sub DistributedEvent-drivenMonitoringModel PA PA PA PA C C C C P 1 3 P 2 2 P 1 2 PA SUBSCRIBEPUBLISH P 3 2 behavior & state events complex events complex queries sub-queries 1 1 2 3 1|2|3 – access control pub sub {'SimpleEvent':{ 'occurrenceTime':'2012-04-11T08:25:13', 'hostname':'lykomedes.fi.muni.cz', 'entity':'org.openssh.sshd.SERVER', 'type':'org.openssh.LOGIN', 'http://openssh.com/v6.1/events.jsch':{ 'user':'bob', 'success':'false', 'sourceIP': 147.165.0.86, 'port':22 } }} DistributedEvent-drivenMonitoringModel PA PA PA PA C C C C P 1 3 P 2 2 P 1 2 PA SUBSCRIBEPUBLISH P 3 2 behavior & state events complex events complex queries sub-queries 1 1 2 3 1|2|3 – access control pub sub Subscribe for DISTR_DICT_ATTACK= select count(*) as hostsNumber from RepeatedLoginEvent.win:time(2 min) where hostsNumber > 10 group by hostname AND REPEATED_LOGIN= select hostname, username, success, count(*) as attempts from LoginEvent.win:time(60 sec) where attempts > 1000, success=false group by hostname, username DistributedEvent-drivenMonitoringModel PA PA PA PA C C C C P 1 3 P 2 2 P 1 2 PA SUBSCRIBEPUBLISH P 3 2 behavior & state events complex events complex queries sub-queries 1 1 2 3 1|2|3 – access control pub sub REPEATED_LOGIN DISTR_DICT_ATTACK LOGIN LOGIN 3 DistributedEvent-drivenMonitoringModel PA PA PA PA C C C C P 1 3 P 2 2 P 1 2 PA SUBSCRIBEPUBLISH P 3 2 behavior & state events complex events complex queries sub-queries 1 1 2 3 1|2|3 – access control pub sub REPEATED_LOGIN DISTR_DICT_ATTACK LOGIN LOGIN DistributedEvent-drivenMonitoringModel PA PA PA PA C C C C P 1 3 P 2 2 P 1 2 PA SUBSCRIBEPUBLISH P 3 2 behavior & state events complex events complex queries sub-queries 1 1 2 3 1|2|3 – access control pub sub REPEATED_LOGIN DISTR_DICT_ATTACK {'ComplexEvent':{ 'id':19058906, 'occurrenceTime':'2012-04-11T08:25:13.129Z', 'hostname':'processing-agent-14.fi.muni.cz', 'entity':‘cloud1-group', 'type':'cz.muni.fi.ngmon.DISTR_DICT_ATTACK', 'http://ngmon.fi.muni.cz/v1.0/cplxevents.jsch':{ 'hostnames':[aisa.fi, ... , lykomedes.fi], 'hostsNumber': 19, 'users':[xtovarn, tomp] } }} C Differentrepresentationofthemodel EventProcessingAgents • Processing agent performs one or more processing functions -- operators • Filter • Time window • sliding-tuple, sliding, tumble • Aggregation (+ group by) • sum, count, stdev, min, max • Sequence detection • Multi-way JOIN Box-And-ArrowsQueries P P P WIN 60 secs AGG count FILTER c > 1000 WIN 2 mins FILTER succ = false AGG count FILTER c > 10 C GB user, hostname GB hostname Box-And-ArrowsQueries P P P WIN 60 secs AGG count FILTER c > 1000 WIN 2 mins FILTER succ = false AGG count FILTER c > 10 C GB user, hostname GB hostname ProcessingAgent 1 ProcessingAgent 2 Box-And-ArrowsQueries P P P WIN 60 secs AGG count FILTER c > 1000 WIN 2 mins FILTER succ = false AGG count FILTER c > 10 C GB user, hostname GB hostname ProcessingAgents 1..N ProcessingAgents N+1, N+2 Models • Event Processing Algebra • simple EP operator algebra • time and space complexity of each operator • Distributed monitoring (meta?)model (static, dyn.) • best operators distribution • (w.r.t. available nodes, bandwidth, ever) • latency (minimize) • throughput (maximize) • What data (from where) are needed to detect the pattern? • which producers, what events? PrototypeImplementation–Currentstate • Prototype of distributed variant (simple static deployment with known patterns) • as the number of monitored nodes grows, new monitoring nodes can be added – almost linear scalability • Typical CEP engine is able to process 50k-100k events per second • Distributed engine/algorithm under development • Lightweigth engine (limited set of operators for monitoring) • Erlang is used – scalability, reliability, robustness Summary-DEDMM • Our goal is behavior monitoring of many distributed producers in real-time • The model introduces paradigm shift towards online data processing utilizing complex event processing and detection • We aim at fully-distributed event processing ExtensiontoSmartGrid • Considerable volumes of data produced by relatively static set of producers • Moderate variability of monitoring data • primarily measurements • Unreliable and slow communication channels • GPRS (EDGE) SimulationenvironmentforSmartGrid • Joint collaboration of Mycroft Mind, CERIT-SC MU, ČEZ, and Lasaris FI MU • 3,500,000 smart meters simulated in CERIT Cloud (unique project in Europe) • Several concepts presented today were used for the simulation environment monitoring