prof. Barbara Russo
Free University of Bozen-Bolzano, Italy
SwSE - Software and Systems Engineering
Masaryk University, November 10th 2016
Mining logs to predict
system errors
2
•Computational power increases
•(2015) Tianhe-2, China - 3,120,000 cores
•The larger the system, the more frequent
critical events
•Lower overall system utilisation
•Hardware failure, software failure, and
user errors
Today systems
3
•Crashes
•immediately stop the system
•easily identifiable (e.g., disk failure)
•but can originate a large number of events
spread across components
•Deviations from the expected output
•let the system run
•reveal only at completion of system tasks
Manage failures …
4
… we need information on system behaviour
and make failure predictions
To better manage failures …
5
• Part of such data traces the change in
behaviour of the system and its sub-
components
• Logging services store state changes of a
system in archives, logs
Systems generate big data
6
•How can we exploit log data to model and
predict system behaviour?
Mining logs
7
•A log event represents a change in a
system state
Log events
8
xml log event
9
•Some events can tell about undesirable
system behaviour
System misbehaviour
10
•Events in error state (error events) act as
alerts of system failures:
• Interpretation of event data might be hard
• Originated from a series of preceding events
Error events act as alerts
11
Logs can be cryptic
SAP
12
•Failure, but the program exited cleanly
Interpretation
YY-MM-DD-HH:MM:SS NULL ZZZ MYHOST FAILURE xxx exited
normally with exit code 0
13
•If the system administrator was doing
maintenance on the machine, this
message is a harmless artefact of his
actions
•If it was generated during normal
machine operation, this message
indicates that all running jobs on the
computer were undesirably killed
Interpretation
14
•We need to understand the operational
context
Operational context
15
•System changes can have introduced errors
much earlier than an error manifests in logs
Originated from a series of
preceding events
16
The classification problem
Data Sets Classifier
G2 =Non-Faulty
G1= Faulty
Features
17
•Event sequence: set of events ordered by
their timestamp occurring within a given
time window
•A sequence abstraction is a
representation of such sequence (e.g.,
vector) that can be used to feed classifiers
(features)
Sequences
18
A. Isolating sequences
•Identify sequence length
•Characterise sequence information
B. Build sequence abstraction (e.g., a vector)
C. Build features
Building features
19
Isolating sequences
Different length, different types
20
•μi – number of the events of type i in a
sequence (multiplicity)
•sv=[μ1, …,μn] – vector of event
multiplicities
Sequence abstraction
21
{General, Log In, Performance, Systems}
sv1=[0,1,0,1]
sv2=[2,1,1,0]
Example – sequence
abstraction
22
Multiple sequences and users
µ1 … µn
s7
s30
s2
s14
s10
Same length, same types
23
•v= [sv, μ(sv), ν(sv)] – feature
•μ(sv) = # sequences mapping onto sv
•ν(sv) = average # of users in sequences mapping
onto sv
•ρ(sv) = number of errors in sequences
mapping onto sv
•v is an faulty feature if at least one event
in one sequence is in error state, ρ(sv)>0.
Features
24
v1= [0,1,0,1;1,1], sv1=[0,1,0,1]
μ(sv1) =1, ν(sv1)=1, ρ(sv1)=0
v2 = [2,1,1,0;1,2], sv2=[2,1,1,0]
μ(sv2) =1, ν(sv2)=2, ρ(sv2)=2
Example - features
Which models do we use to
predict system behaviour?
25
26
The classification problem
Data Sets Classifier
Different ex-ante
distributions:
(faulty, non-faulty)
G2 =Non-Faulty
G1= Faulty
Ex-post classification differs
on different classifier’s
thresholds
Features
27
•The problem varies depending on how
many errors we allow in the system
•c – cut-off value, i.e., number of errors in a
feature
•Categories:
•G1(c)={v = [sv, μ(sv),ν(sv)] | ρ(sv)≥c} - faulty
•G2(c)={v = [sv, μ(sv),ν(sv)] | ρ(sv)