prof. Barbara Russo Free University of Bozen-Bolzano, Italy SwSE - Software and Systems Engineering Masaryk University, November 10th 2016 Mining logs to predict system errors 2 •Computational power increases •(2015) Tianhe-2, China - 3,120,000 cores •The larger the system, the more frequent critical events •Lower overall system utilisation •Hardware failure, software failure, and user errors Today systems 3 •Crashes •immediately stop the system •easily identifiable (e.g., disk failure) •but can originate a large number of events spread across components •Deviations from the expected output •let the system run •reveal only at completion of system tasks Manage failures … 4 … we need information on system behaviour and make failure predictions To better manage failures … 5 • Part of such data traces the change in behaviour of the system and its sub- components • Logging services store state changes of a system in archives, logs Systems generate big data 6 •How can we exploit log data to model and predict system behaviour? Mining logs 7 •A log event represents a change in a system state Log events 8 xml log event 9 •Some events can tell about undesirable system behaviour System misbehaviour 10 •Events in error state (error events) act as alerts of system failures: • Interpretation of event data might be hard • Originated from a series of preceding events Error events act as alerts 11 Logs can be cryptic SAP 12 •Failure, but the program exited cleanly Interpretation YY-MM-DD-HH:MM:SS NULL ZZZ MYHOST FAILURE xxx exited normally with exit code 0 13 •If the system administrator was doing maintenance on the machine, this message is a harmless artefact of his actions •If it was generated during normal machine operation, this message indicates that all running jobs on the computer were undesirably killed Interpretation 14 •We need to understand the operational context Operational context 15 •System changes can have introduced errors much earlier than an error manifests in logs Originated from a series of preceding events 16 The classification problem Data Sets Classifier G2 =Non-Faulty G1= Faulty Features 17 •Event sequence: set of events ordered by their timestamp occurring within a given time window •A sequence abstraction is a representation of such sequence (e.g., vector) that can be used to feed classifiers (features) Sequences 18 A. Isolating sequences •Identify sequence length •Characterise sequence information B. Build sequence abstraction (e.g., a vector) C. Build features Building features 19 Isolating sequences Different length, different types 20 •μi – number of the events of type i in a sequence (multiplicity) •sv=[μ1, …,μn] – vector of event multiplicities Sequence abstraction 21 {General, Log In, Performance, Systems} sv1=[0,1,0,1] sv2=[2,1,1,0] Example – sequence abstraction 22 Multiple sequences and users µ1 … µn s7 s30 s2 s14 s10 Same length, same types 23 •v= [sv, μ(sv), ν(sv)] – feature •μ(sv) = # sequences mapping onto sv •ν(sv) = average # of users in sequences mapping onto sv •ρ(sv) = number of errors in sequences mapping onto sv •v is an faulty feature if at least one event in one sequence is in error state, ρ(sv)>0. Features 24 v1= [0,1,0,1;1,1], sv1=[0,1,0,1] μ(sv1) =1, ν(sv1)=1, ρ(sv1)=0 v2 = [2,1,1,0;1,2], sv2=[2,1,1,0] μ(sv2) =1, ν(sv2)=2, ρ(sv2)=2 Example - features Which models do we use to predict system behaviour? 25 26 The classification problem Data Sets Classifier Different ex-ante distributions: (faulty, non-faulty) G2 =Non-Faulty G1= Faulty Ex-post classification differs on different classifier’s thresholds Features 27 •The problem varies depending on how many errors we allow in the system •c – cut-off value, i.e., number of errors in a feature •Categories: •G1(c)={v = [sv, μ(sv),ν(sv)] | ρ(sv)≥c} - faulty •G2(c)={v = [sv, μ(sv),ν(sv)] | ρ(sv)