Advanced Topics in Applied Regression Day 3: Semi-parametric models Constantin Manuel Bosancianu Wissenschaftszentrum Berlin Institutions and Political Inequality unit manuel.bosancianu@wzb.eu October 1, 2017 Constantin Manuel Bosancianu Wissenschaftszentrum Berlin, IPI unit October 1, 2017 1 / 44 Non-parametric specifications When transformation cause more harm than good. q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 2 4 6 1 2 3 4 5 Predictor Outcome q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 0.0 0.5 1.0 1.5 1 2 3 4 5 Predictor log(Outcome) Constantin Manuel Bosancianu Wissenschaftszentrum Berlin, IPI unit October 1, 2017 2 / 44 Why go nonparametric ... Linearity of specification is essentially the default for most empirical testing. Not very much thought is given to the possibility of nonlinear relationships ⇒ we don’t really have developed theories about functional forms in the social sciences. Linear in parameters: Yi = a + b1X1i + b2X2i + ei. Linear in parameters, but not in variables: Yi = a + b1X1i + b2X2i + b3X22 i + ei. Constantin Manuel Bosancianu Wissenschaftszentrum Berlin, IPI unit October 1, 2017 3 / 44 ...when you have power transformations? Turnouti = a + b1 × Agei + b2 × Age2 i + ei (1) However, a power transformation will change the shape of the relationship globally, not only in a specific section. It’s also the case that, usually, the choice of which power transformation to use is arbitrary: X2, X3, √ X, ... Choosing based on a model fit criterion is not guaranteed to result in the proper model being selected. Constantin Manuel Bosancianu Wissenschaftszentrum Berlin, IPI unit October 1, 2017 4 / 44 Smoothers Advantages: faster, and easier to present, than neural networks, support vector machines, or tree-based methods; still rely on the linear regression machinery; functional form of the model is not imposed on the data, but estimated from it. However, they are considerably more computationally intensive than OLS. Additionally, they don’t produce tables of results, but a graphical representation. Constantin Manuel Bosancianu Wissenschaftszentrum Berlin, IPI unit October 1, 2017 5 / 44 Local Polynomial Regression Constantin Manuel Bosancianu Wissenschaftszentrum Berlin, IPI unit October 1, 2017 6 / 44 Local polynomial regression (LPR) A better strategy is to model directly the process which generated the data points. Yi = f (X1, X2, . . . ) + ei (2) This f (X1, X2, . . . ) could be a standard linear specification, but also a nonlinear one estimated directly from the data. All that the LPR expects is that the function be smooth.1 1In “math-speak”, that the first-order derivative is defined at every point of the function. Constantin Manuel Bosancianu Wissenschaftszentrum Berlin, IPI unit October 1, 2017 7 / 44 Moving average smoother qq q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q 20 40 60 10 20 30 Perot vote (%) Supportforchallengers(%) Support for Perot in 1992 and vote for challengers (US House) Constantin Manuel Bosancianu Wissenschaftszentrum Berlin, IPI unit October 1, 2017 8 / 44 Constructing the bins A lot rides on how to construct the bins: too narrow and variability of means increases, too wide and the trend appears too smooth. A few strategies: bins of equal range (like above) – however, some might contain little data; bins with equal amounts of data; window bin which moves across X. The last is most frequently used—observations move in and out of the window, and are used in computing the average. Constantin Manuel Bosancianu Wissenschaftszentrum Berlin, IPI unit October 1, 2017 9 / 44 Moving window process qq q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q 20 40 60 10 20 30 Perot vote (%) Supportforchallengers(%) Selected q qNo Yes Moving window approach Constantin Manuel Bosancianu Wissenschaftszentrum Berlin, IPI unit October 1, 2017 10 / 44 Window width q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q qq q q q q q q qq q q q q q q q q q q q q q qq q q q q q q q q q qq q qq q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q qq q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q 20 40 60 10 20 30 Perot vote (%) Supportforchallengers(%) Effect of changing window width: 21 Constantin Manuel Bosancianu Wissenschaftszentrum Berlin, IPI unit October 1, 2017 11 / 44 Window width q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q qq q q q q q q qq q q q q q q q q q q q q q qq q q q q q q q q q qq q qq q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q qq q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q 20 40 60 10 20 30 Perot vote (%) Supportforchallengers(%) Effect of changing window width: 51 Constantin Manuel Bosancianu Wissenschaftszentrum Berlin, IPI unit October 1, 2017 12 / 44 Window width q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q qq q q q q q q qq q q q q q q q q q q q q q qq q q q q q q q q q qq q qq q q q q q q q q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q qq q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q 20 40 60 10 20 30 Perot vote (%) Supportforchallengers(%) Effect of changing window width: 81 Constantin Manuel Bosancianu Wissenschaftszentrum Berlin, IPI unit October 1, 2017 13 / 44 Kernel smoothing One problem with the moving average smoother is that it allocates equal weight to all cases, irrespective of how far they are to the focal point (the center of the moving window). Kernel smoothing addresses this by adding 2 extra steps to the mix: a “distance” measure from the center of the window; a weighting function, based on distance. The average now becomes a weighted one, but nothing else changes. Constantin Manuel Bosancianu Wissenschaftszentrum Berlin, IPI unit October 1, 2017 14 / 44 Kernel smoothing Distance measure: zi = xi −x0 h x0 is the center of the window, and h is its width. The most popular weighting function is the tricube kernel: KT (z) = (1 − |z|3)3, for |z| < 1 0, for |z| ≥ 1. (3) You can imagine the moving average as a weighted procedure with equal weights. Constantin Manuel Bosancianu Wissenschaftszentrum Berlin, IPI unit October 1, 2017 15 / 44 Kernel smoothing qq q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q qq q q q qq q q q q q q q q q q q q q q 20 40 60 10 20 30 Perot vote (%) Supportforchallengers(%) Kernel smoothing with bin width of 4 Constantin Manuel Bosancianu Wissenschaftszentrum Berlin, IPI unit October 1, 2017 16 / 44 LPR: benefits Both the moving average and the kernel smoother are essentially computing averages. However, we can go beyond this and actually run a regression of Y on X. The two most famous procedures are loess and lowess (Cleveland, 1979). Yi = a + b1Xi + b2X2 i + · · · + bpX p i + ei (4) In empirical work it’s very rare to see p > 3. Constantin Manuel Bosancianu Wissenschaftszentrum Berlin, IPI unit October 1, 2017 17 / 44 Polynomial specifications −4 −2 0 2 4 −4 −2 0 2 4 0 20 X1 X2 Y Equation: Y = 2.5X1 + 1.75X2 + 2X2 1 + X2 2 Constantin Manuel Bosancianu Wissenschaftszentrum Berlin, IPI unit October 1, 2017 18 / 44 LPR: implementation (I) A window width is chosen, usually in terms of % of data (similar to kernel smoothers). Within each bin, a WLS estimation of the polynomial specification is conducted. Yi wi = a wi + b1 Xi wi + b2 X2 i wi + · · · + bp X p i wi + ei wi (5) The wi are typically assigned with the tricube kernel used in kernel smoothing. Constantin Manuel Bosancianu Wissenschaftszentrum Berlin, IPI unit October 1, 2017 19 / 44 LPR: implementation (II) In a second stage, a set of robustness weights is obtained from the specification in Equation 5. These are then applied to the model, for another round of estimation. Then we start again with the wi, then with robustness weights. The process stops when there is minimal change in estimates from one iteration to another. The use of wi is what distinguishes lowess from loess. Constantin Manuel Bosancianu Wissenschaftszentrum Berlin, IPI unit October 1, 2017 20 / 44 LPR for Perot vote qq q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q qq q q q qq q q q q q q q q q q q q q q 20 40 60 10 20 30 Perot vote (%) Supportforchallengers(%) LPR with span of 0.30 Constantin Manuel Bosancianu Wissenschaftszentrum Berlin, IPI unit October 1, 2017 21 / 44 LPR for Perot vote qq q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q qq q q q qq q q q q q q q q q q q q q q 20 40 60 10 20 30 Perot vote (%) Supportforchallengers(%) LPR with span of 0.45 Constantin Manuel Bosancianu Wissenschaftszentrum Berlin, IPI unit October 1, 2017 22 / 44 LPR for Perot vote qq q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q qq q q q qq q q q q q q q q q q q q q q 20 40 60 10 20 30 Perot vote (%) Supportforchallengers(%) LPR with span of 0.60 Constantin Manuel Bosancianu Wissenschaftszentrum Berlin, IPI unit October 1, 2017 23 / 44 Local polynomials: choices In practice, it does not matter very much whether it’s loess or lowess, or the polynomial order (p). The span matters: very low ⇒ low bias, but high variance: “undersmoothing”; very high ⇒ high bias, but low variance: “oversmoothing”. A middle ground has to be found by the researcher, but with erring on the side of “undersmoothing” (Keele, 2008, p. 34).2 2Same with polynomial order: better fit vs. extra parameters. Constantin Manuel Bosancianu Wissenschaftszentrum Berlin, IPI unit October 1, 2017 24 / 44 Inference Since at each step a regression is run, inference based on SEs can be easily conducted.3 This is usually displayed on the plot, in the form of confidence intervals around the line. 3The formulas for this are no longer nice, so I omit them here, but you can get a quick look at them in Keele (2008, pp. 39–41) Constantin Manuel Bosancianu Wissenschaftszentrum Berlin, IPI unit October 1, 2017 25 / 44 Inference for Perot vote qq q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q qq q q q qq q q q q q q q q q q q q q q 20 40 60 10 20 30 Perot vote (%) Supportforchallengers(%) LPR with span of 0.50 and SEs Constantin Manuel Bosancianu Wissenschaftszentrum Berlin, IPI unit October 1, 2017 26 / 44 Hypothesis testing An even more powerful application is to test whether the nonparametric model fits the data better than the parametric linear one. We can do this as the latter model is a restricted version of the former. F = RSS0−RSS1 J RSS1 dfres (6) dfres = n − p1, where p1 is the effective number of parameters of the smoother, and n is the sample size. dfres need not be an integer in the case of smoothers. Constantin Manuel Bosancianu Wissenschaftszentrum Berlin, IPI unit October 1, 2017 27 / 44 Hypothesis testing The other terms in the formula: RSS0: residual sum of squares from restricted specification; RSS1: residual sum of squares from nonparametric specification; J = p1 − p0: difference in effective number of parameters In this case, the F-test = 4.742065, and p < 0.001, suggesting that the nonparametric model fits the data better. Constantin Manuel Bosancianu Wissenschaftszentrum Berlin, IPI unit October 1, 2017 28 / 44 Regression Splines Constantin Manuel Bosancianu Wissenschaftszentrum Berlin, IPI unit October 1, 2017 29 / 44 Advantages of splines will provide the best mean squared error fit; a smoothing spline is designed to prevent overfitting; easier to incorporate in semiparametric models than LPR. Constantin Manuel Bosancianu Wissenschaftszentrum Berlin, IPI unit October 1, 2017 30 / 44 The use of knots Splines are essentially polynomials fitted to separate regions of the data. The “borders” between these regions are called knots (just values of X). The polynomials are fit to the separate regions, and forced to meet at the knot. Constantin Manuel Bosancianu Wissenschaftszentrum Berlin, IPI unit October 1, 2017 31 / 44 Piece-wise polynomials −20 −10 0 10 20 −200 −100 0 100 200 k1 k2 X Y Very simple example, with 1st order polynomial and 2 knots Constantin Manuel Bosancianu Wissenschaftszentrum Berlin, IPI unit October 1, 2017 32 / 44 Piece-wise polynomials The number of knots, their position, as well as the degree of the polynomial specification are chosen by the researcher. For most realistic problems, only 4–5 knots are really needed. Let’s take the situation of a single knot: Yi = a + b1Xi + b2(Xi )+ + ei (7) Constantin Manuel Bosancianu Wissenschaftszentrum Berlin, IPI unit October 1, 2017 33 / 44 Piece-wise polynomials This (X)+ is a new variable, which we obtain from X, based on its position with respect to c1 (the knot position). (xi )+ = xi, for xi > c1 0, for xi ≤ c1. (8) Yi = a + b1Xi, for xi ≤ c1 a + b1Xi + b2(Xi − c1), for xi > c1. (9) Constantin Manuel Bosancianu Wissenschaftszentrum Berlin, IPI unit October 1, 2017 34 / 44 More complex forms So far, I have only used linear specifications, as they make the formulas accessible. However, we can easily specify quadratic and cubic specifications, in case the data patterns reveal such forms are needed. knots can be added by default at the lowest and highest data points, so as to fit cubic splines in these regions as well: natural splines; piece-wise functions can be rescaled, so as to avoid collinearity between X and (X)+: B-splines. Constantin Manuel Bosancianu Wissenschaftszentrum Berlin, IPI unit October 1, 2017 35 / 44 More complex forms −20 −10 0 10 20 −200 −100 0 100 200 X Y Constantin Manuel Bosancianu Wissenschaftszentrum Berlin, IPI unit October 1, 2017 36 / 44 Knot placement and # Usually placed at equal intervals in the data, e.g. quartiles or quintiles. The number of knots is the more important choice—it governs how smooth the final fit will be. 2 methods: visual: start with 4 knots, and increase/decrease number if the fit is too smooth/rough; statistical: use the AIC of the fit, and select the number of knots that produces the lowest AIC. Constantin Manuel Bosancianu Wissenschaftszentrum Berlin, IPI unit October 1, 2017 37 / 44 Splines for Perot vote qq q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q qq q q q qq q q q q q q q q q q q q q q 20 40 60 10 20 30 Perot vote (%) Supportforchallengers(%) Cubic B-spline and natural spline fits Constantin Manuel Bosancianu Wissenschaftszentrum Berlin, IPI unit October 1, 2017 38 / 44 Smoothing splines Constantin Manuel Bosancianu Wissenschaftszentrum Berlin, IPI unit October 1, 2017 39 / 44 Penalized splines One frequent accusation is that it’s very easy to overfit the data with splines, as you can simply select a large number of knots. Penalized (smoothing) splines are a solution to this, as they introduce a penalty for every additional parameter estimated.4 In addition to the number of knots, penalized splines also ask you to specify a parameter λ, called the smoothing (tuning) parameter. The higher it is, the smoother the fit (but, also, more biased). 4In the same way that the adjusted R2 includes such a penalty. Constantin Manuel Bosancianu Wissenschaftszentrum Berlin, IPI unit October 1, 2017 40 / 44 Penalized splines qq q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q qq q q q qq q q q q q q q q q q q q q q 20 40 60 10 20 30 Perot vote (%) Supportforchallengers(%) Smoothing spline with 4 knots and λ = 0.00179. Constantin Manuel Bosancianu Wissenschaftszentrum Berlin, IPI unit October 1, 2017 41 / 44 Penalized splines qq q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q qq q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q qq q q q q q q q q q q q q qq q q q qq q q q q q q q q q q q q q q 20 40 60 10 20 30 Perot vote (%) Supportforchallengers(%) Smoothing spline with 10 knots and λ = 0.00179. Constantin Manuel Bosancianu Wissenschaftszentrum Berlin, IPI unit October 1, 2017 42 / 44 Thank you for the kind attention! Constantin Manuel Bosancianu Wissenschaftszentrum Berlin, IPI unit October 1, 2017 43 / 44 References I Cleveland, W. S. (1979). Robust Locally Weighted Regression and Smoothing Scatterplots. Journal of the American Statistical Association, 74(368), 829–836. Keele, L. (2008). Semiparametric Regression for the Social Sciences. Chichester, UK: Wiley. Constantin Manuel Bosancianu Wissenschaftszentrum Berlin, IPI unit October 1, 2017 44 / 44