Mining Time Series L. Torgo & T. Rudolecky Time Series Introduction A Definition Definition A time series is a set of observations of a variable thatare ordered by time. E.g., x1, x2, ,x 1, xt ,xt+1, ,xn where xt is the observation of variable X at time t . A multivariate time series is a set of observations of a set of variables over a certain period of time. Time Series 2 / 45 Goals Explanation Obtaining a Time Series Model help us to have a Deeper Understanding of the Mechanism that Generated the Observed Time Series Data. Time Series 3 / 45 Goals Forecasting The Goal of Time Series Forecasting The Past! Given: x1, x2, ,x 1, xt Obtain: a time series model Which is able to make predictions concerning: xt+1, ,xn The Future! Time Series 4 / 45 Goals Time Series Data Mining Main Time Series Data Mining Tasks Indexing (Query by Content) Given a query time series Q and a similarity measure D(Q, X) find the most similar time series in a database D Clustering Find the natural goupings of a set of time series in a database D using some similarity measure D(Q, X) Classification Given an unlabelled time series Q, assign it a label C from a set of pre-defined labels (classes) Time Series 5 / 45 Exploratory Analysis Summaries Summaries of Time Series Data Standard descriptive statistics (mean, standard deviation, etc.) do not allways work with time series (TS) data. TS may contain trends, seasonality and some other systematic components, making these stats misleading. So, for proving summaries of TS data we will be interested in concepts like trend, seasonality and correlation between sucessive observations of the TS. Time Series 6 / 45 Exploratory Analysis Variation Types of Variation Seasonal Variation Some time series exhibit a variation that is annual in period, e.g. demand for ice cream. Other Cyclic Variation Some time series have periodic variations that are not related to seasons but to other factors, e.g. some economic time series. Trends A trend is a long-term change in the mean level of the time series. Time Series 7 / 45 Exploratory Analysis Stationarity Stationarity An Informal Definition A time series is said to be stationary if there is no systematic change in mean (no trend), if there is no systematic change in variance and if strictly periodic variations have been removed. Note that in these cases statistics like mean, standard deviation, variance, etc., bring relevant information! Time Series 8 / 45 Exploratory Analysis Stationarity Stationarity Time Series 9 / 45 Exploratory Analysis Time Plots Time Plots flow.vatflow.jokprec 0100204060820060100140103050 1972.0 1972.5 1973.0 1974.0 1974.5 1975.0 temp 1973.5 Time ice.river Ploting the time series values against time is one of the most important tools for analysing its behaviour. Time plots show important features like trends, seasonality, outliers and discontinuities. Time Series 10 / 45 Exploratory Analysis Transformations - I Transformations - I Plotting the data may suggest transformations : To stabilize the variance Symptoms: trend with the variance increasing with the mean. Solution: logarithmic transformation. To make the seasonal effects additive Symptoms: there is a trend and the size of the seasonal effect increases with the mean(multiplicative seasonality). Solution: logarithmic transformation. t t t To remove trend Symptoms: there is systematic change on the mean. Solution 1: first order differentiation ( Xt = Xt X 1). Solution 2: model the trend and subtractit from the original series (Y = Xt rt). Time Series 11 / 45 Exploratory Analysis Transformations - an example (1) Transformations - a simple example (1) 0 20 40 60 80 100 01000200030004000 time xt An example time series with trend and a multiplicative seasonality effect. Time Series 12 / 45 Exploratory Analysis Transformations - an example (2) Transformations - a simple example (2) 0 20 40 60 80 100 020004000 time xt 0 20 40 60 80 100 yt xt 7.708 42.521 t time yt 0500 0 20 40 60 80 100 7.07.47.8 zt log10 ~yt zt 0 20 40 60 80 100 1 wt zt z z time time wt 0 Time Series 13 / 45 Exploratory Analysis Randomness Tests of Randomness Frequently we want to test the hypothesis that the observed time series is random. A possible way is to inspect the correlogram. An alternative, which is frequently used, is the runs test. This test basically checks for things like the number of times the value of xt is above (below) the median value of the series, or the number of times there is a sucession of monotonically increasing (decreasing) values of the series and so on. Time Series 18 / 45 Exploratory Analysis Check List Handling Real World Data A Check List of Common Sense Things to Do (taken from Chatfield, 2004) Do you understand the context? Have the right variables been measured? Have all the time series been plotted? Are there missing values? If so, what should be done about them? Are there any outliers? If so, what should be done about them? Are there any discontinuities? If so, what do they mean? Does it make sense to transform the variables? Is trend present? If so, what should be done about it? Is seasonality present? If so, what should be done about it? Time Series 19 / 45 Why? Measuring Similarity Why? Most time series data mining tasks require the similarity between series to be asserted (e.g. indexing, clustering, classification, etc.). Types of matching There are essentially two variants of similarity matching: Whole matching where the query time series is matched (as a whole) against all time series in the data base. Subsequence matching where all time series in the data base are searched for a subsection match against the query subsequence. Time Series 20 / 45 Distance Measures Defining a Distance Function Distance (or dissimilarity) functions Given any two time series s1 and s2 their distance (or dissimilarity) is denoted by D(s1, s2). Desirable Properties of a Distance Function Symmetry D(X, Y) = D(Y, X) Constancy of Self-Similarity D(X, X) = 0 Positivity D(X, Y) = 0 iff X = Y Triangular Inequality D(X, Y) D(X, Z) + D(Y, Z) Time Series 21 / 45 Distance Measures Types of Distance Functions Metric - satisfy all properties e.g. Euclidean, correlation, etc. Non-metric - do not satisfy any of the properties e.g. time warping, LCSS, etc. Time Series 22 / 45 Distance Measures Minkowski Metrics The Minkowski Metrics City Block (p = 1) Euclidean (p = 2) Time Series 23 / 45 Distance Measures Correlation Correlation between two time series Time Series 24 / 45 Distance Measures Dynamic Time Warping Dynamic Time Warping - introduction Dynamic Time Warping (DTW) is a non-metric distance function. Main Ideas of DTW Allow for local deformations (stretch and shrink) along the time axis. Able to handle series of different lengths X Y Time Series 25 / 45 Distance Measures Dynamic Time Warping Dynamic Time Warping - how to calculate? implemented using dynamic programming techniques. Time Series 26 / 45 Distance Measures Dynamic Time Warping Dynamic Time Warping - how to speed-up calculations? A B We now only fill only a small portion of the matrix Minimum Bounding Envelope (MBE) taken from Michalis Vlachos tutorial on Time Series Restrict the set of paths (warping paths) that are considered to find the Several methods exist to carry out this restriction (e.g. Sakoe-Chiba band, Itakura parallelogram, etc.) Time Series 27 / 45 Distance Measures Dynamic Time Warping LCSS Longest common subsequence Time Series 28 / 45 Distance Measures Dynamic Time Warping LCSS Longest common subsequence Time Series 29 / 45 Distance Measures Dynamic Time Warping Comparison of distance functions The graphs represent the results of associations when computing (A) and (C) distance based on the longest common subsequence (LCSS). Time Series 30 / 45 Goals Goals of an Evaluation Method The golden rule: The data used for evaluating (or comparing) any models cannot be seen during model development. The goal of any evaluation procedure: Obtain a reliable estimate of some evaluation measure. High probability of achieving the same score on other samples of the same population. Evaluation Measures Predictive accuracy. Model size. Computational complexity. Time Series 31 / 45 Reliable Estimates Obtaining Reliable Estimates The usual techniques for model evaluation revolve around resampling. Simulating the reality. Obtain an evaluation estimate for unseen data. Examples of Resampling-based Methods Holdout. Cross-validation. Bootstrap. Time Series Data Are Special! Any form of resampling changes the natural order of the data! Time Series 32 / 45 Evaluation Methodology Correct Evalution of Time Series Models General Guidelines Do not the time tags of the observations. Do not evaluate a model on past data. A possible method Divide the existing data in two time windows Past data (observations till a time t). t). Use one of these three learn-test alternatives Fixed learning window. Growing window. Sliding window. Time Series 33 / 45 Evaluation Methodology Learn-Test Strategies tinit tfinal t Fixed Grow Slide ws wv Fixed Window A single model is obtained with the available data, and applied to all test period. Growing Window Every wv test cases a new model is obtained using all data available till then. observations of the time series. Sliding Window Every wv test cases a new model is obtained using the previous ws Time Series 34 / 45 Evaluation Measures Some Metrics for Evaluating Predictive Performance Time Series 35 / 45 Assumptions Assumptions Approaches Linearity The model of the time series behaviour is linear on its inputs. Stationarity The underlying equations governing the behaviour of the system do not change with time. Most approaches assume stationary time series, thus one usually needs to transform non-stationary time series into stationary ones before using these tools. Time Series 36 / 45 Moving Averages Moving Average Models Definition A moving average of order q, MA(q),is a time series given by lh 1.52.02.53.03.5 series MA(5) MA(15) 0 10 20 30 40 Time Time Series 37 / 45 Exponential MAs Exponential Moving Average Models Definition An exponential moving average is a series given by Yt = Xt + (1 ) EMA (X 1) Y1 = X1 where [0..1] is a smoothing parameter. lh 1.52.02.53.03.5 series EMA(0.25) EMA(0.5) 0 10 20 30 40 Time Time Series 38 / 45 Autoregressive AR Autoregressive (AR) Models Definition An autoregressive model of order p is a series given by lh 1.52.02.53.03.5 series AR(1) AR(3) 0 10 20 30 40 Time Time Series 39 / 45 Autoregressive ARMA Mixed Autoregressive and Moving Average Models Definition A mixed ARMA model of order p, q is a series given by lh 1.52.02.53.03.5 series ARMA(2,1) ARMA(1,3) 0 10 20 30 40 Time Time Series 40 / 45 Autoregressive ARIMA Integrated ARMA (or ARIMA) Models Definition An integrated ARMA (or ARIMA) model of order p,d, q is a series given by Time Series 41 / 45 Case Dependecies Clustering Time Series 42 / 45 [1] Case Dependecies Clustering Time Series 43 / 45 Whole time-series clustering is considered as clustering of a set of individual time-series with respect to their similarity. Here, clustering means applying conventional (usually) clustering on discrete objects, where objects are time-series. Subsequence clustering means clustering on a set of subsequences of a time-series that are extracted via a sliding window, that is, clustering of segments from a single long time-series. Time point clustering is another category of clustering which is seen in some papers. It is clustering of time points based on a combination of their temporal proximity of time points and the similarity of the corresponding values. This approach is similar to time-series segmentation. However, it is different from segmentation as all points do not need to be assigned to clusters, i.e., some of them are considered as noise. [5] Case Dependecies Clustering of time series subsequences Time Series 44 / 45 Subsequence Clustering: Given a single time series, sometimes in the form of streaming time series, individual time series (subsequences) are extracted with a sliding window. Clustering is then performed on the extracted time series subsequences. [6] Case Dependecies Discretisation: SAX, PAA, TVA Time Series 45 / 45 [2] Case Dependecies Resources Time Series 44 / 45 [1] Selina Chu, Eamonn Keogh, David Hart and Michael Pazzani. Iterative Deepening Dynamic Time Warping for Time Series [2] Bilal Esmael, Arghad Arnaout, Rudolf K. Fruhwirth and Gerhard Thonhauser. Multivariate Time Series Classification by Combining TrendBased and Value-Based Approximations [3] Nuno Castro and Paulo Azevedo. Multiresolution Motif Discovery in Time Series [4] Florence Duchene, Catherine Garbay and Vincent Rialle. Mining Heterogeneous Multivariate Time-Series for Learning Meaningful Patterns: Application to Home Health Telecare [5] Saeed Aghabozorgi, Ali Seyed Shirkhorshidi and Teh Ying Wah. Timeseries clustering - A decade review [6] Eamonn Keogh and Jessica Lin. Clustering of Time Series Subsequences is Meaningless: Implications for Previous and Future Researc