LECTURE 2 Introduction to Econometrics INTRODUCTION TO LINEAR REGRESSION ANALYSIS I. September 27, 2016 1 / 33 PREVIOUS LECTURE... Introduction, organization, review of statistical background random variables mean, variance, standard deviation covariance, correlation, independence statistical distributions standardized random variables 2 / 33 LECTURE 2. Introduction to simple linear regression analysis Sampling and estimation OLS principle Readings: Studenmund, A. H., Using Econometrics: A Practical Guide, Chapters 1, 2.1, 17.2, 17.3 Wooldridge, J. M., Introductory Econometrics: A Modern Approach, Chapters 2.1, 2.2 3 / 33 WARM-UP EXERCISE The heights of U.S. females between age 25 and 34 are approximately normally distributed with a mean of 66 inches and a standard deviation of 2.5 inches. What fraction of U.S. female population in this age bracket is taller than 70 inches, the height of average adult U.S. male of this age? 4 / 33 SAMPLING Population: the entire group of items that interests us Sample: the part of the population that we actually observe Statistical inference: use of the sample to draw conclusion about the characteristics of the population from which the sample came Examples: medical experiments, opinion polls 5 / 33 RANDOM SAMPLING VS SELECTION BIAS Correct statistical inference can be performed only on a random sample - a sample that reflects the true distribution of the population Biased sample: any sample that differs systematically from the population that it is intended to represent Selection bias: occurs when the selection of the sample systematically excludes or under represents certain groups Example: opinion poll about tuition payments among undergraduate students vs all citizens Self-selection bias: occurs when we examine data for a group of people who have chosen to be in that group Example: accident records of people who buy collision insurance 6 / 33 EXERCISE 2 American Express and the French tourist office sponsored a survey that found that most visitors to France do not consider the French to be especially unfriendly. The sample consisted of 1,000 Americans who have visited France more than once for pleasure over the past two years. Is this survey unbiased? 7 / 33 ESTIMATION Parameter: a true characteristic of the distribution of a variable, whose value is unknown, but can be estimated Example: population mean E[X] Estimator: a sample statistic that is used to estimate the value of the parameter Example: sample mean Xn Note that the estimator is a random variable (it has a probability distribution, mean, variance,...) Estimate: the specific value of the estimator that is obtained on a specific sample 8 / 33 PROPERTIES OF AN ESTIMATOR An estimator is unbiased if the mean of its distribution is equal to the value of the parameter it is estimating An estimator is consistent if it converges to the value of the true parameter as the sample size increases An estimator is efficient if the variance of its sampling distribution is the smallest possible 9 / 33 EXERCISE 3 The Slovak Ministry of Labor and Social Affairs aimed to evaluate the impact of some of its re-qualification courses for newly unemployed workers. For this purpose, the Ministry tracked workers who lost their jobs in October 2015 and went through 3-months long re-qualification program. The Ministry found that 90 % of workers who finished the course found a new job within 6 months after finishing the course. The Ministry concluded that the re-qualification program was successful. Was the evaluation unbiased? 10 / 33 ECONOMETRIC MODELS Econometric model is an estimable formulation of a theoretical relationship Theory says: Q = f(P, Ps, Y) Q . . . quantity demanded P . . . commodity’s price Ps . . . price of substitute good Y . . . disposable income We simplify: Q = β0 + β1P + β2Ps + β3Y We estimate: Q = 31.50 − 0.73P + 0.11Ps + 0.23Y 11 / 33 ECONOMETRIC MODELS Today’s econometrics deals with different, even very general models During the course we will cover just linear regression models We will see how these models are estimated by Ordinary Least Squares (OLS) Generalized Least Squares (GLS) We will perform estimation on different types of data 12 / 33 DATA USED IN ECONOMETRICS cross-section repeated cross-section sample of units several independent (eg. firms, individuals) samples of units taken at a given point in time (eg. firms, individuals) taken at different points in time time-series panel data observations of variable(s) time series for each in different points in time cross-sectional unit in the data set 13 / 33 DATA USED IN ECONOMETRICS - EXAMPLES Country’s macroeconomic indicators (GDP, inflation rate, net exports, etc.) month by month Data about firms’ employees or financial indicators as of the end of the year Records of bank clients who were given a loan Annual social security or tax records of individual workers 14 / 33 STEPS OF AN ECONOMETRIC ANALYSIS 1. Formulation of an economic model (rigorous or intuitive) 2. Formulation of an econometric model based on the economic model 3. Collection of data 4. Estimation of the econometric model 5. Interpretation of results 15 / 33 EXAMPLE - ECONOMIC MODEL Denote: p . . . price of the good c . . . firm’s average cost per one unit of output q(p) . . . demand for firm’s output Firm profit: π = q(p) · (p − c) Demand for good: q(p) = a − b · p Derive: q = a 2 − b 2 · c We call q dependent variable and c explanatory variable 16 / 33 EXAMPLE - ECONOMETRIC MODEL Write the relationship in a simple linear form q = β0 + β1c (have in mind that β0 = a 2 and β1 = −b 2 ) There are other (unpredictable) things that influence firms’ sales ⇒ add disturbance term q = β0 + β1c + ε Find the value of parameters β1 (slope) and β0 (intercept) 17 / 33 EXAMPLE - DATA Ideally: investigate all firms in the economy Really: investigate a sample of firms We need a random (unbiased) sample of firms Collect data: Firm 1 2 3 4 5 6 q 15 32 52 14 37 27 c 294 247 153 350 173 218 18 / 33 EXAMPLE - DATA 10 10 1020 20 2030 30 3040 40 4050 50 50Output Output Output150 150 150200 200 200250 250 250300 300 300350 350 350Average cost Average cost Average cost 19 / 33 EXAMPLE - ESTIMATION 10 10 1020 20 2030 30 3040 40 4050 50 50Output Output Output150 150 150200 200 200250 250 250300 300 300350 350 350Average cost Average cost Average cost 20 / 33 EXAMPLE - ESTIMATION 10 10 1020 20 2030 30 3040 40 4050 50 50Output Output Output150 150 150200 200 200250 250 250300 300 300350 350 350Average cost Average cost Average cost OLS method: Make the fit as good as possible ⇓ Make the misfit as low as possible ⇓ Minimize the (vertical) distance between data points and regression line ⇓ Minimize the sum of squared deviations 21 / 33 TERMINOLOGY yi = β0 + β1xi + εi . . . regression line yi . . . dependent/explained variable (i-th observation) xi . . . independent/explanatory variable (i-th observation) εi . . . random error term/disturbance (of i-th observation) β0 . . . intercept parameter (β0 . . . estimate of this parameter) β1 . . . slope parameter (β1 . . . estimate of this parameter) 22 / 33 ORDINARY LEAST SQUARES OLS = fitting the regression line by minimizing the sum of vertical distance between the regression line and the observed points 10 10 1020 20 2030 30 3040 40 4050 50 50OutputOutputOutput150 150 150200 200 200250 250 250300 300 300350 350 350Average cost Average cost Average cost 23 / 33 ORDINARY LEAST SQUARES - PRINCIPLE Take the squared differences between observed point yi and regression line β0 + β1xi: (yi − β0 − β1xi)2 Sum them over all n observations: n i=1 (yi − β0 − β1xi)2 Find β0 and β1 such that they minimize this sum β0, β1 = argmin β0,β1 n i=1 (yi − β0 − β1xi)2 24 / 33 ORDINARY LEAST SQUARES - DERIVATION β0, β1 = argmin β0,β1 n i=1 (yi − β0 − β1xi)2 FOC: ∂ ∂β0 : −2 n i=1 yi − β0 − β1xi = 0 ∂ ∂β1 : −2 n i=1 xi yi − β0 − β1xi = 0 We express (on the lecture): β0 = yn − β1xn β1 = n i=1 (xi − xn) yi − yn n i=1 (xi − xn)2 25 / 33 RESIDUAL Residual is the vertical difference between the estimated regression line and the observation points OLS minimizes the sum of squares of all residuals It is the difference between the true value yi and the estimated value yi = β0 + β1xi We define: ei = yi − β0 − β1xi Residual ei (observed) is not the same as the disturbance εi (unobserved)!!! Residual is an estimate of the disturbance: ei = εi 26 / 33 RESIDUAL VS. DISTURBANCE 10 10 1020 20 2030 30 3040 40 4050 50 50Output Output Output150 150 150200 200 200250 250 250300 300 300350 350 350Average cost Average cost Average cost True relationship Estimated relationship Disturbance Residual 27 / 33 GETTING BACK TO THE EXAMPLE We have the economic model q = a 2 − b 2 · c We estimate qi = β0 + β1ci + εi (having in mind that β0 = a 2 and β1 = −b 2 ) Over data: Firm 1 2 3 4 5 6 q 15 32 52 14 37 27 c 294 247 153 350 173 218 28 / 33 GETTING BACK TO THE EXAMPLE When we plug in the formula: β1 = 6 i=1 (ci − c) (qi − q) 6 i=1 (ci − c)2 = −0.177 β0 = q − β1c = 71.74 The estimated equation is q = 71.74 − 0.177c and so a = 2β0 = 143.48 and b = −2β1 = 0.354 29 / 33 MEANING OF REGRESSION COEFFICIENT Consider the model q = β0 + β1c estimated as q = 71.74 − 0.177c q . . . demand for firm’s output c . . . firm’s average cost per unit of output Meaning of β1 is the impact of a one unit increase in c on the dependent variable q When average costs increase by 1 unit, quantity demanded decreases by 0.177 units 30 / 33 BEHIND THE ERROR TERM The stochastic error term must be present in a regression equation because of: 1. omission of many minor influences (unavailable data) 2. measurement error 3. possibly incorrect functional form 4. stochastic character of unpredictable human behavior Remember that all of these factors are included in the error term and may alter its properties The properties of the error term determine the properties of the estimates 31 / 33 SUMMARY We have learned that an econometric analysis consists of 1. definition of the model 2. estimation 3. interpretation We have explained the principle of OLS: minimizing the sum of squared differences between the observations and the regression line We have derived the formulas of the estimates: β1 = n i=1 (xi − xn) yi − yn n i=1 (xi − xn)2 β0 = yn − β1xn 32 / 33 WHAT’S NEXT In the next lectures, we will derive estimation formulas for multivariate models specify properties of the OLS estimator 33 / 33