1 M7777 Applied Functional Data Analysis 3. From Data to Functions — Smoothing Jan Koláček (kolacek@math.muni.cz) Dept. of Mathematics and Statistics, Faculty of Science, Masaryk University, Brno n Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 1/28 From Data to Functions How do we go from St. Johns 10H 3 o o. E ,0 Ol data A to functions? 100 200 Days 300 St. Johns 15 10 O a. 5 E -5 100 200 Days Jan Koláček (SCI MUNI) M7777 Applied FDA Basis Expansions We consider y; = x(t/) + Si Si ~ /./.of and x (ř/) = 5ZcA(ř/)- 7=1 Let us denote • y = (yi,...,y/vy, x = (x(ti)J...Jx(tA/))/ • 0 ... a A/ x /< matrix containing values ^(t;) • c = (ci,..., ck); ■ ■ ■ basis coefficients We can write (1) as x = 0C. Jan Koláček (SCI MUNI) Least Squares How to find c? Minimize the sum of squared errors N SSE(c) = ]T(y/ - x(t;))2 = (y - 4>c)'(y - Oc) 1 = 1 This is just linear regression! The SSE is minimized by the ordinary least squares estimate c = (O'O^O'y Thus we have the estimate x(t) = 0*(t) (O'O^O'y. Standard model: suppose i.i.d e\ =4> E(y) = 0c, Var(y) = a2l/v Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 4/28 Weighted Least Squares Practical problems • heteroscedastic data • autocorrelated data Partial solution: Weighted Least Squares N WSSE(c) = w?(Yi ~ x(ti))2 = (y - 4>c)'W'W(y - 0c) 1=1 with W = diag{wi,..., i/i/a/}-We get an estimate x(t) = 0*(t) (O'WO)"1 0'Wy. and fitted values y = 0 (0/W0)"1 0'Wy = Sy. S ... smoothing matrix. Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 5/28 Choosing the Number of Basis Functions How many basis functions? • Small numbers of basis functions mean little flexibility. • Larger numbers of basis functions add flexibility, but may "overfit". • For Monomial and Fourier bases, just add functions to the collection • Spline bases: adding knots or increasing the order changes the basis. Trade off: • Too many basis functions over-fits the data and reflect errors of measurement. • Too few basis functions fails to capture interesting features of the curves. Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 6/28 Bias and Variance Tradeoff Measure of quality of x(t) • Bias Bias[x(t)] = E[x(t)\ - x(t) • Variance Var[x(t)] = E{x(t)-E[x(t)]}2 • Too many basis functions =4> small bias, large sampling variance. • Too few basis functions =4> small sampling variance, large bias. • Mean Squared Error MSE[x(t)] = E [{x(t) - x(t)}2] = Bias2[x(t)] + Var[x(t)] • Integrated Mean Squared Error IMSE[x(t)] = / MSE[x(t)]dt Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 7/28 Choosing the Number of Basis Functions St. Johns Precipitation: 3 Fourier Bases St. Johns MISE for St. Johns 10.0 7.5 o CO Q_ o 5.0 CD 2.5 • > t • -• • • M • » • • • # • ( I • • • • • • w • • ***** • mm • • • • • « • • • • c ) 100 200 300 0.100 0.075 LU CO 0.050 0.025 0.000 1 \ % \ \ \ » \ \ \ » 1 1 * * * % % % % V ■ —J- I i 10 15 20 Terms - MISE — Bias Var Days # Fourier Basis Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 8/28 Choosing the Number of Basis Functions St. Johns Precipitation: 5 Fourier Bases M7777 Applied FDA Choosing the Number of Basis Functions St. Johns Precipitation: 7 Fourier Bases St. Johns MISE for St. Johns 10.0 7.5 o CO Q_ o 5.0 CD 2.5 • > t • -• • • M • » • • A • # • ( I • • > m_ • • • • • • ***** • mm • • • • • « • • • • c ) 100 200 300 0.100 0.075 LU CO 0.050 0.025 0.000 » \ 1 \ * \ » \ \ \ \ t 1 * * ■ % s \^ % % % % V ■ i 10 15 20 Terms - MISE — Bias Var Days # Fourier Basis Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 10/28 Choosing the Number of Basis Functions St. Johns Precipitation: 9 Fourier Bases St. Johns MISE for St. Johns 10.0 7.5 o CO Q_ o 5.0 CD 2.5 • > t • -• • • mm • » 9 m • • A m 5 • ( I • • »_» mm • • • • • • • aa • • • • • « • • • ► é+m « • c ) 100 200 300 0.100 0.075 LU CO 0.050 0.025 0.000 \ \ 1 \ » \ » \ » \ t \ » % 1 * * • * * * V I i 10 15 20 Terms - MISE — Bias Var Days # Fourier Basis Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 11/28 Choosing the Number of Basis Functions St. Johns Precipitation: 15 Fourier Bases Jan Koláček (SCI MUNI) M7777 Applied FDA Choosing the Number of Basis Functions St. Johns Precipitation: 35 Fourier Bases St. Johns 10.0 4 7.5 4 c o "■4—I CO • "q. o 5.0 CD ^_ Q_ 2.5 4 • • • • • • • -• ... • • • • • • • • • • • •• • • 1 • • • • * at • A tiff J - • • • • « • • • • • 0 100 200 300 Days Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 13/28 Choosing the Number of Basis Functions St. Johns Precipitation: 75 Fourier Bases St. Johns 10.0 4 7.5- % c o "■4—I CO • "q. o 5.0 CD ^_ Q_ 2.5 4 M 9 0 100 200 300 Days Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 14/28 Choosing the Number of Basis Functions St. Johns Precipitation: 105 Fourier Bases St. Johns 10.0 4 7.5- % c o "■4—I CO • "q. o 5.0 CD ^_ Q_ 2.5 4 M 9 0 100 200 300 Days Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 15/28 Choosing the Number of Basis Functions St. Johns Precipitation: 365 Fourier Bases St. Johns 10.0 4 0 100 200 300 Days Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 16/28 Choosing the Number of Basis Functions Cross-Validation • Leave out one observation (t/,y,-) and construct an estimate x_/( from remaining data. • Choose K to minimize the ordinary cross-validation score N i=l For a linear smooth ý = Sy 0CVW = 77 £ (y. - »Mi Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 17/28 Choosing the Number of Basis Functions St. Johns Precipitation: Cross-Validated Error OCV for St. Johns 2.2- 2.1 - 2.0- 1.9- 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 # Fourier Basis Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 18/28 Pointwise Confidence Bands Estimating the Variance • Standard model assumes Var[y] = a2\N An unbiased estimate (can be more sophisticated for correlated residuals) a2 = N-K SS£(c) For a linear smooth ý = Sy Var[y] = a2SS'. More generally, for Var[y] = Z Var[y] = Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 19/28 Pointwise Confidence Bands Pointwise Confidence Intervals • For each point we calculate lower and upper bands for y by y ± uo.975VVar[y]- • These bands are not confidence bands for the entire curve, but on for the value of the curve at a fixed point. • Ignores bias in the estimated curve. • Provide an impression of how well the curve is estimated. Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 20/28 Choosing the Number of Basis Functions Fitted St. Johnes Precipitation Data with 9 Fourier Bases and Confident Bands St. Johns 10.0 4 7.5 4 c o "■4—I CO • "q. o 5.0 CD 2.5 4 • • • • • • • • • • • • • • • • •• r • • • •.WA: •V: ■ • • • • • • - • • • • 0 100 200 Days 300 Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 21/28 Problems to solve O Melanoma Data • Load the variable melanoma from the f da package and plot it. • Fit these data with a Fourier basis, choosing the number of basis functions by minimizing the gcv value returned by smooth.basis. Plot the Cross-Validation function and the final fit (see Figures 1,2). • Try removing a linear trend for these data first, by looking at the residuals after a call to lm. Repeat the steps above; does the optimal number of basis functions change? • Re-fit the data using an gcv-optimal B-spline basis. Plot the CV function for this basis and the final fit (see Figures 3,4). • Plot the previous fit with its pointwise 95% confidence bands (see Figure 5). What's the observed value of incidences in 1950? What's the estimated mean confidence band for this year? [2.2, (2.49,2.96)] Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 22/28 Problems to solve 0 Canadian Weather Data • Load the variable CanadianWeather from the f da package and select the precipitation in St. Johns. • Fit these data using a B-spline basis with 5 basis functions. • What's the observed value of precipitation in St. Johns on January 23? What's the estimated mean confidence band for this day? [4.4, (4.37,4.98)] Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 23/28 Problems to solve 60 H 40 H > O O 20 H OH • ■ i i i 1 i i i i i i i i i i i i i i i 1 / / 1U # Fourier Basis 30 Figure 1. Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 24/28 Problems to solve Figure 2. Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 25/28 Problems to solve 0.8-f 0.6H O 0.4 O 0.2 H O.OH I i i > M i 1 '; '. > i i i ' i i i : • ; : \ / » 1 ' 1 ' s* 1 11 1 / i 'i ' i i i, i ; ; • 1U 30 # B-spline Basis Figure 3. Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 26/28 Problems to solve Figure 4. Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 27/28 Problems to solve 4 2 * — 4 / / x 777->// ' - ^ x / / '•/ /// - -x »V • . x /': " / / ' / i ' L > ' ri ' I' > * * * / * * / * / / - s ' / / * n • / // — ». <• / / x x .— / 4 / / • \ / •/ _ - ^ „ ' * / ' / / / / / / / — — * 1940 1950 1960 1970 Year Figure 5. Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 28/28