M7777 Applied Functional Data Analysis 3. From Data to Functions — Smoothing Jan Koláček (kolacek@math.muni.cz) Dept. of Mathematics and Statistics, Faculty of Science, Masaryk University, Brno Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 1/28 From Data to Functions How do we go from data St. Johns I S \ f 0 1( \ )0 2( )0 -1 3( m functions? St. Johns 0 100 200 300 Days Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 2 / 28 Basis Expansions We consider y; = x(t/) + Si Si ~ /./.of and x (ř/) = 5ZcA(ř/)- (1) 7=1 Let us denote • y = (yi,...,y/vy, x = (x(ti),...Jx(tA/))/ • 0 ... a A/ x K matrix containing values ^(t;) • c = (ci,..., ck); ■ ■ ■ basis coefficients We can write (1) as x = 0C. Jan Koláček (SCI MUNI) Least Squares How to find c? Minimize the sum of squared errors N SSE(c) = ]T(y/ - x(t;))2 = (y - 4>c)'(y - Oc) 1 = 1 This is just linear regression! The SSE is minimized by the ordinary least squares estimate c = (O'O^O'y Thus we have the estimate x(t) = 0*(t) (O'O^O'y. Standard model: suppose i.i.d e\ =4> E(y) = 0c, Var(y) = a2l/v Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 4 / 28 Weighted Least Squares Practical problems • heteroscedastic data • autocorrelated data Partial solution: Weighted Least Squares N WSSE(c) = w?(Yi ~ x(ti))2 = (y - 4>c)'W'W(y - 0c) 1=1 with W = diag{wi,..., i/i/a/}-We get an estimate x(t) = 0*(t) (O'WO)"1 0'Wy. and fitted values y = 0 (0/W0)"1 0'Wy = Sy. S ... smoothing matrix. Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 5 / 28 Choosing the Number of Basis Functions How many basis functions? • Small numbers of basis functions mean little flexibility. • Larger numbers of basis functions add flexibility, but may "overfit". • For Monomial and Fourier bases, just add functions to the collection • Spline bases: adding knots or increasing the order changes the basis. Trade off: • Too many basis functions over-fits the data and reflect errors of measurement. • Too few basis functions fails to capture interesting features of the curves. Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 6 / 28 Bias and Variance Tradeoff Measure of quality of x(t) • Bias Bias[x(t)] = E[x(t)]-x(t) • Variance Var[x(t)] = E{x(t)-E[x(t)]}2 • Too many basis functions =4> small bias, large sampling variance. • Too few basis functions =4> small sampling variance, large bias. • Mean Squared Error MSE[x(t)] = E [{x(t) - x(t)}2] = Bias2[x(t)] + Var[x(t)] • Integrated Mean Squared Error IMSE[x(t)] = / MSE[x(t)]dt Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 7 / 28 Choosing the Number of Basis Functions St. Johns Precipitation: 3 Fourier Bases St. Johns MISE for St. Johns 10.0 7.5 o H—' CO -t—> 'q. o 5.0 2.5 • > % • • • • » • • A • * • t • • ** — • • • * • • V*. • • • • • •* * • • * •« • * • .a * • • • • • ti • • 100 200 Days 300 0.100 0.075 LU CO 0.050 0.025 0.000 % \ » \ t \ » \ * t t t t í » * \^ \ % \ \ % \ \ \ \ % \ 5 10 15 20 Terms - MISE Bias Var # Fourier Basis Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 8 / 28 Choosing the Number of Basis Functions St. Johns Precipitation: 5 Fourier Bases St. Johns MISE for St. Johns 10.0 7.5 o H—' CO -t—> 'q. o 5.0 2.5 • > % • • • • » • • • * • t • • • • • * • • •• • • • • • > • • • • • • • • •« • * • .a * • • • • • • 100 200 Days 300 0.100 0.075 LU CO 0.050 0.025 0.000 t \ % \ % \ t \ » \ * t \ \ t X % % % \^ \ % ' \ \ s \ \ \ s 1 5 10 15 20 Terms - MISE Bias Var # Fourier Basis Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 9 / 28 Choosing the Number of Basis Functions St. Johns Precipitation: 7 Fourier Bases St. Johns MISE for St. Johns 10.0 7.5 o H—' CO -t—> 'q. o 5.0 2.5 • > t • • • • » • • • * • • • • • • * • • •••• •^-m^*1 ..• • • > «v •••»-! V*. • • • • • • • • fir-**? • • * • .a * • • • • • • 100 200 Days 300 0.100 0.075 LU CO 0.050 0.025 0.000 % \ » \ t \ * \ * t t t t í % 1 * * * * 5 10 15 20 Terms - MISE Bias Var # Fourier Basis Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 10 / 28 Choosing the Number of Basis Functions St. Johns Precipitation: 9 Fourier Bases St. Johns MISE for St. Johns 10.0 7.5 o H—' CO -t—> 'q. o 5.0 2.5 • > % • • • • » • • • * • t • • , •• • • • • * • • • V*. • • • • • • • * • * • .a * • • • • • • 100 200 Days 300 0.100 0.075 LU CO 0.050 0.025 0.000 t \ % \ % \ t \ * \ * t \ \ t X % i * \^ \ % \ \ % \ \ \ % \ 5 10 15 20 Terms - MISE Bias Var # Fourier Basis Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 11 / 28 Choosing the Number of Basis Functions St. Johns Precipitation: 15 Fourier Bases St. Johns MISE for St. Johns 10.0 7.5 o H—' CO -t—> 'q. o 5.0 2.5 • > t • • • • • mm • • • • • • • • •* \ • •* 9" t m • • • MÍ • • • • • • • • 100 200 Days 300 0.100 0.075 LU CO 0.050 0.025 0.000 t \ % \ % \ t \ * \ * t \ \ t X % i * \^ \ % \ \ % \ \ \ % \ 5 10 1 15 20 Terms - MISE Bias Var # Fourier Basis Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 12 / 28 Choosing the Number of Basis Functions St. Johns Precipitation: 35 Fourier Bases St. Johns 0 100 200 300 Days Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 13 / 28 Choosing the Number of Basis Functions St. Johns Precipitation: 75 Fourier Bases St. Johns 0 100 200 300 Days Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 14 / 28 Choosing the Number of Basis Functions St. Johns Precipitation: 105 Fourier Bases St. Johns 0 100 200 300 Days Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 15 / 28 Choosing the Number of Basis Functions St. Johns Precipitation: 365 Fourier Bases St. Johns 10.0 H 7.5 H c o "■4—I CO -i—1 "q. o 5.0 CD 2.5 4 0 100 200 300 Days Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 16 / 28 Choosing the Number of Basis Functions Cross-Validation • Leave out one observation (t/,y,-) and construct an estimate x. from remaining data. • Choose K to minimize the ordinary cross-validation score N OCV{x) = ±Y.{y;-x_i{t;)) 1 = 1 For a linear smooth ý = Sy 0CVW = 77 £ (y. - »Mi Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 17 / 28 Choosing the Number of Basis Functions St. Johns Precipitation: Cross-Validated Error OCV for St. Johns 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 # Fourier Basis Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 18 / 28 Pointwise Confidence Bands Estimating the Variance • Standard model assumes Var[y] = a2\N An unbiased estimate (can be more sophisticated for correlated residuals) a2 = N-K SS£(c) For a linear smooth ý = Sy Var[y] = a2SS'. More generally, for Var[y] = Z Var[y] = Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 19 / 28 Pointwise Confidence Bands Pointwise Confidence Intervals • For each point we calculate lower and upper bands for y by y ± uo.975VVar[y]- • These bands are not confidence bands for the entire curve, but for the value of the curve at a fixed point. • Ignores bias in the estimated curve. Provide an impression of how well the curve is estimated. Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 20 / 28 Choosing the Number of Basis Functions Fitted St. Johnes Precipitation Data with 9 Fourier Bases and Confident Bands St. Johns 10.0 H 7.5 H c o "■4—I CO • "q. o 5.0 CD 2.5 4 • • • • • • • • • • • • • • • • •• • r • • • •.WA: m —• » mt v*, • • • • - • • •. • • * ... • 0 100 200 Days 300 Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 21 / 28 Problems to solve O Melanoma Data • Load the variable melanoma from the f da package and plot it. • Fit these data with a Fourier basis, choosing the number of basis functions by minimizing the gcv value returned by smooth.basis. Plot the Cross-Validation function and the final fit (see Figures 1,2). • Try removing a linear trend for these data first, by looking at the residuals after a call to lm. Repeat the steps above; does the optimal number of basis functions change? • Re-fit the data using an gcv-optimal B-spline basis. Plot the CV function for this basis and the final fit (see Figures 3,4). • Plot the previous fit with its pointwise 95% confidence bands (see Figure 5). What's the observed value of incidences in 1950? What's the estimated mean confidence band for this year? [2.9, (2.49,2.96)] Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 22 / 28 Problems to solve © Canadian Weather Data • Load the variable CanadianWeather from the f da package and select the precipitation in St. Johns. • Fit these data using a B-spline basis with 5 basis functions. • What's the observed value of precipitation in St. Johns on January 23? What's the estimated mean confidence band for this day? [4.4, (4.37,5.07)] Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 23 / 28 Problems to solve 60 40 20 0 i ! 1 1 1 1 1 1 1 f 1 1 1 f 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 / / / / / A o" 10 20 ~30 # Fourier Basis Figure 1. Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 24 / 28 Problems to solve Jan Koláček (SCI MUNI) Figure 2. M7777 Applied FDA Fall 2019 25 / 28 Problems to solve 0.8 O 0.2 0.0 t 1 i • • i < i i > 'i i > 11 1 1 ■ i i i * i i > > i i i i i i i * Hü; s , \ \ » 1 ' ✓ Ná I ' l| 1 / 4 ' %"m \ 1 li 1 / » ' li i * • 10- 20 "SO # B-spline Basis Figure 3. Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 26 / 28 Problems to solve Figure 4. Jan Koláček (SCI MUNI) M7777 Applied FDA Fall 2019 27 / 28 Problems to solve Jan Koláček (SCI MUNI) Figure 5. M7777 Applied FDA Fall 2019 28 / 28