Mining Time Series
L. Torgo & T. Rudolecky
Time Series Introduction
A Definition
Definition
A time series is a set of observations of a variable thatare
ordered by time.
E.g.,
x1, x2, ,x 1, xt ,xt+1, ,xn
where xt is the observation of variable X at time t .
A multivariate time series is a set of observations of a set of
variables over a certain period of time.
Time Series 2 / 45
Goals
Explanation
Obtaining a Time Series Model help us to have
a Deeper Understanding of the Mechanism
that Generated the Observed Time Series Data.
Time Series 3 / 45
Goals
Forecasting
The Goal of Time Series Forecasting
The Past!
Given:
x1, x2, ,x 1, xt
Obtain:
a time series model
Which is able to make predictions concerning:
xt+1, ,xn The Future!
Time Series 4 / 45
Goals
Time Series Data Mining
Main Time Series Data Mining Tasks
Indexing (Query by Content)
Given a query time series Q and a similarity measure D(Q, X)
find the most similar time series in a database D
Clustering
Find the natural goupings of a set of time series in a database D
using some similarity measure D(Q, X)
Classification
Given an unlabelled time series Q, assign it a label C from a set of
pre-defined labels (classes)
Time Series 5 / 45
Exploratory Analysis Summaries
Summaries of Time Series Data
Standard descriptive statistics (mean, standard deviation, etc.) do
not allways work with time series (TS) data.
TS may contain trends, seasonality and some other systematic
components, making these stats misleading.
So, for proving summaries of TS data we will be interested in
concepts like trend, seasonality and correlation between
sucessive observations of the TS.
Time Series 6 / 45
Exploratory Analysis Variation
Types of Variation
Seasonal Variation
Some time series exhibit a variation that is annual in period, e.g.
demand for ice cream.
Other Cyclic Variation
Some time series have periodic variations that are not related to
seasons but to other factors, e.g. some economic time series.
Trends
A trend is a long-term change in the mean level of the time series.
Time Series 7 / 45
Exploratory Analysis Stationarity
Stationarity
An Informal Definition
A time series is said to be stationary if
there is no systematic change in mean (no trend),
if there is no systematic change in variance and
if strictly periodic variations have been removed.
Note that in these cases statistics like mean, standard deviation,
variance, etc., bring relevant information!
Time Series 8 / 45
Exploratory Analysis Stationarity
Stationarity
Time Series 9 / 45
Exploratory Analysis Time Plots
Time Plots
flow.vatflow.jokprec
0100204060820060100140103050
1972.0 1972.5 1973.0 1974.0 1974.5 1975.0
temp
1973.5
Time
ice.river
Ploting the time series
values against time is
one of the most
important tools for
analysing its behaviour.
Time plots show
important features like
trends, seasonality,
outliers and
discontinuities.
Time Series 10 / 45
Exploratory Analysis Transformations - I
Transformations - I
Plotting the data may suggest transformations :
To stabilize the variance
Symptoms: trend with the variance increasing with the mean.
Solution: logarithmic transformation.
To make the seasonal effects additive
Symptoms: there is a trend and the size of the seasonal effect
increases with the mean(multiplicative seasonality).
Solution: logarithmic transformation.
t t t
To remove trend
Symptoms: there is systematic change on the mean.
Solution 1: first order differentiation ( Xt = Xt X 1).
Solution 2: model the trend and subtractit from the original series
(Y = Xt rt).
Time Series 11 / 45
Exploratory Analysis Transformations - an example (1)
Transformations - a simple example (1)
0 20 40 60 80 100
01000200030004000
time
xt
An example time series with
trend and a multiplicative
seasonality effect.
Time Series 12 / 45
Exploratory Analysis Transformations - an example (2)
Transformations - a simple example (2)
0 20 40 60 80 100
020004000
time
xt
0 20 40 60 80 100
yt xt 7.708 42.521 t
time
yt
0500
0 20 40 60 80 100
7.07.47.8
zt log10
~yt
zt
0 20 40 60 80 100
1
wt
zt z
z
time time
wt
0
Time Series 13 / 45
Exploratory Analysis Randomness
Tests of Randomness
Frequently we want to test the hypothesis that the observed time
series is random.
A possible way is to inspect the correlogram.
An alternative, which is frequently used, is the runs test.
This test basically checks for things like the number of times the value
of xt is above (below) the median value of the series, or the number of
times there is a sucession of monotonically increasing (decreasing)
values of the series and so on.
Time Series 18 / 45
Exploratory Analysis Check List
Handling Real World Data
A Check List of Common Sense Things to Do
(taken from Chatfield, 2004)
Do you understand the context? Have the right variables been
measured?
Have all the time series been plotted?
Are there missing values? If so, what should be done about them?
Are there any outliers? If so, what should be done about them?
Are there any discontinuities? If so, what do they mean?
Does it make sense to transform the variables?
Is trend present? If so, what should be done about it?
Is seasonality present? If so, what should be done about it?
Time Series 19 / 45
Why?
Measuring Similarity
Why?
Most time series data mining tasks require the similarity between
series to be asserted (e.g. indexing, clustering, classification, etc.).
Types of matching
There are essentially two variants of similarity matching:
Whole matching
where the query time series is matched (as a whole) against all
time series in the data base.
Subsequence matching
where all time series in the data base are searched for a
subsection match against the query subsequence.
Time Series 20 / 45
Distance Measures
Defining a Distance Function
Distance (or dissimilarity) functions
Given any two time series s1 and s2 their distance (or dissimilarity) is
denoted by D(s1, s2).
Desirable Properties of a Distance Function
Symmetry
D(X, Y) = D(Y, X)
Constancy of Self-Similarity
D(X, X) = 0
Positivity
D(X, Y) = 0 iff X = Y
Triangular Inequality
D(X, Y) D(X, Z) + D(Y, Z)
Time Series 21 / 45
Distance Measures
Types of Distance Functions
Metric - satisfy all properties
e.g. Euclidean, correlation, etc.
Non-metric - do not satisfy any of the properties
e.g. time warping, LCSS, etc.
Time Series 22 / 45
Distance Measures Minkowski Metrics
The Minkowski Metrics
City Block (p = 1)
Euclidean (p = 2)
Time Series 23 / 45
Distance Measures Correlation
Correlation between two time series
Time Series 24 / 45
Distance Measures Dynamic Time Warping
Dynamic Time Warping - introduction
Dynamic Time Warping (DTW) is a non-metric distance function.
Main Ideas of DTW
Allow for local deformations (stretch and shrink) along the time
axis.
Able to handle series of different lengths
X
Y
Time Series 25 / 45
Distance Measures Dynamic Time Warping
Dynamic Time Warping - how to calculate?
implemented using dynamic programming techniques.
Time Series 26 / 45
Distance Measures Dynamic Time Warping
Dynamic Time Warping - how to speed-up
calculations?
A
B
We now only fill only a small
portion of the matrix
Minimum
Bounding
Envelope
(MBE)
taken from Michalis Vlachos tutorial on Time Series
Restrict the set of paths (warping paths) that are considered to find the
Several methods exist to carry out this restriction (e.g.
Sakoe-Chiba band, Itakura parallelogram, etc.)
Time Series 27 / 45
Distance Measures Dynamic Time Warping
LCSS Longest common subsequence
Time Series 28 / 45
Distance Measures Dynamic Time Warping
LCSS Longest common subsequence
Time Series 29 / 45
Distance Measures Dynamic Time Warping
Comparison of distance functions
The graphs represent the results of associations when computing (A)
and (C) distance based on the longest common subsequence (LCSS).
Time Series 30 / 45
Goals
Goals of an Evaluation Method
The golden rule:
The data used for evaluating (or comparing) any models cannot
be seen during model development.
The goal of any evaluation procedure:
Obtain a reliable estimate of some evaluation measure.
High probability of achieving the same score on other samples of
the same population.
Evaluation Measures
Predictive accuracy.
Model size.
Computational complexity.
Time Series 31 / 45
Reliable Estimates
Obtaining Reliable Estimates
The usual techniques for model evaluation revolve around
resampling.
Simulating the reality.
Obtain an evaluation estimate for unseen data.
Examples of Resampling-based Methods
Holdout.
Cross-validation.
Bootstrap.
Time Series Data Are Special!
Any form of resampling changes the natural order of the data!
Time Series 32 / 45
Evaluation Methodology
Correct Evalution of Time Series Models
General Guidelines
Do not the time tags of the observations.
Do not evaluate a model on past data.
A possible method
Divide the existing data in two time windows
Past data (observations till a time t).
t).
Use one of these three learn-test alternatives
Fixed learning window.
Growing window.
Sliding window.
Time Series 33 / 45
Evaluation Methodology
Learn-Test Strategies
tinit tfinal
t
Fixed
Grow
Slide
ws
wv
Fixed Window
A single model is obtained with the available data, and
applied to all test period.
Growing Window
Every wv test cases a new model is obtained using all data available
till then.
observations of the time series.
Sliding Window
Every wv test cases a new model is obtained using the previous ws
Time Series 34 / 45
Evaluation Measures
Some Metrics for Evaluating Predictive Performance
Time Series 35 / 45
Assumptions
Assumptions Approaches
Linearity
The model of the time series behaviour is linear on its inputs.
Stationarity
The underlying equations governing the behaviour of the system
do not change with time.
Most approaches assume stationary time series, thus one
usually needs to transform non-stationary time series into stationary
ones before using these tools.
Time Series 36 / 45
Moving Averages
Moving Average Models
Definition
A moving average of order q, MA(q),is a time series given by
lh
1.52.02.53.03.5
series
MA(5)
MA(15)
0 10 20 30 40
Time
Time Series 37 / 45
Exponential MAs
Exponential Moving Average Models
Definition
An exponential moving average is a series given by
Yt = Xt + (1 ) EMA (X 1)
Y1 = X1
where [0..1] is a smoothing parameter.
lh
1.52.02.53.03.5
series
EMA(0.25)
EMA(0.5)
0 10 20 30 40
Time
Time Series 38 / 45
Autoregressive AR
Autoregressive (AR) Models
Definition
An autoregressive model of order p is a series given by
lh
1.52.02.53.03.5
series
AR(1)
AR(3)
0 10 20 30 40
Time
Time Series 39 / 45
Autoregressive ARMA
Mixed Autoregressive and Moving Average Models
Definition
A mixed ARMA model of order p, q is a series given by
lh
1.52.02.53.03.5
series
ARMA(2,1)
ARMA(1,3)
0 10 20 30 40
Time
Time Series 40 / 45
Autoregressive ARIMA
Integrated ARMA (or ARIMA) Models
Definition
An integrated ARMA (or ARIMA) model of order p,d, q is a series
given by
Time Series 41 / 45
Case Dependecies
Clustering
Time Series 42 / 45
[1]
Case Dependecies
Clustering
Time Series 43 / 45
Whole time-series clustering is considered as clustering of a set of
individual time-series with respect to their similarity. Here, clustering
means applying conventional (usually) clustering on discrete objects, where
objects are time-series.
Subsequence clustering means clustering on a set of subsequences of a
time-series that are extracted via a sliding window, that is, clustering of
segments from a single long time-series.
Time point clustering is another category of clustering which is seen in
some papers. It is clustering of time points based on a combination of their
temporal proximity of time points and the similarity of the
corresponding values. This approach is similar to time-series segmentation.
However, it is different from segmentation as all points do not need to be
assigned to clusters, i.e., some of them are considered as noise.
[5]
Case Dependecies
Clustering of time series subsequences
Time Series 44 / 45
Subsequence Clustering: Given a single time series, sometimes in the
form of streaming time series, individual time series (subsequences) are
extracted with a sliding window. Clustering is then performed on the
extracted time series subsequences.
[6]
Case Dependecies
Discretisation: SAX, PAA, TVA
Time Series 45 / 45
[2]
Case Dependecies
Resources
Time Series 44 / 45
[1] Selina Chu, Eamonn Keogh, David Hart and Michael Pazzani. Iterative
Deepening Dynamic Time Warping for Time Series
[2] Bilal Esmael, Arghad Arnaout, Rudolf K. Fruhwirth and Gerhard
Thonhauser. Multivariate Time Series Classification by Combining TrendBased
and Value-Based Approximations
[3] Nuno Castro and Paulo Azevedo. Multiresolution Motif Discovery in
Time Series
[4] Florence Duchene, Catherine Garbay and Vincent Rialle. Mining
Heterogeneous Multivariate Time-Series for Learning Meaningful Patterns:
Application to Home Health Telecare
[5] Saeed Aghabozorgi, Ali Seyed Shirkhorshidi and Teh Ying Wah. Timeseries
clustering - A decade review
[6] Eamonn Keogh and Jessica Lin. Clustering of Time Series Subsequences
is Meaningless: Implications for Previous and Future Researc