Econometrics Introduction to Linear Regression Analysis Anna Donina Lecture 2 1/30 PREVIOUS LECTURE... Introduction, organization, review of statistical background ▪ random variables ▪ mean, variance, standard deviation ▪ covariance, correlation, independence ▪ normal distribution ▪ standardized random variables 2/33 WARM-UP EXERCISE ► What is the correlation between X andY? ► Correlation: Corr(X, Y) = Cov(X,Y) σXσY ► Covariance: Cov(X,Y) = E [(X − E[X])(Y − E[Y])] = E [XY] − E[X]E[Y] 3/33 LECTURE 2. 4/33 Introduction to simple linear regressionanalysis • Sampling and estimation • OLS principle Readings: Studenmund, A. H., Using Econometrics: A Practical Guide, Chapters 1, 2.1, 16.1, 16.2 Wooldridge, J. M., Introductory Econometrics: A Modern Approach, Chapters 2.1, 2.2 SAMPLING 5/33 • Population: the entire group of items that interestsus • Sample: the part of the population that we actually observe • Statistical inference: use of the sample to draw conclusion about the characteristics of the population from which the sample came Examples: medical experiments, opinionpolls RANDOM SAMPLING VS SELECTION BIAS 6/33 Correct statistical inference can be performed only on a random sample - a sample that reflects the true distribution of the population Biased sample: any sample that differs systematically from the population that it is intended to represent Selection bias: occurs when the selection of the sample systematically excludes or underrepresents certaingroups Example: opinion poll about tuition payments among undergraduate students vs all citizens Self-selection bias: occurs when we examine data for a group of people who have chosen to be in that group Example: accident records of people who buy collision insurance EXERCISE 1 7/33 • American Express and the French tourist office sponsored a survey that found that most visitors to France do not consider the French to be especially unfriendly. • The sample consisted of 1,000 Americans who have visited France more than once for pleasure over the past two years. • Is this surveyunbiased? ESTIMATION Parameter: a true characteristic of the distribution of a variable, whose value is unknown, but can be estimated Example: population mean E[X] Estimator: a sample statistic that is used to estimate the value of the parameter Example: sample mean Note that the estimator is a random variable (it has a probability distribution, mean, variance,...) Estimate: the specific value of the estimator that is obtained on a specific sample PROPERTIES OF AN ESTIMATOR 9/33 • An estimator is unbiased if the mean of its distribution is equal to the value of the parameter it is estimating • An estimator is consistent if it converges to the value of the true parameter as the sample size increases • An estimator is efficient if the variance of its sampling distribution is the smallest possible EXERCISE 2 10/33 • A young econometrician wants to estimate the relationship between foreign direct investments (FDI) in her country and firm profitability. • Her reasoning is that better managerial skills introduced by foreign owners increases firms’ profitability. • She collects a random sample of 8,750 firms and finds that one sixth of the firms were entered within last few yearsby foreign investors. The rest of the firms are owned domestically. • When she compares indicators of profitability, such as ROA and ROE, between the domestic and foreign-owned firms, she finds significantly better outcomes for foreign-owned firms. • She concludes that FDI increases firms’ profitability. Is this conclusion correct? ECONOMETRIC MODELS 11/33 Econometric model is an estimable formulation of a theoretical relationship Theory says: Q = f(P,Ps, Y) Q . . . quantity demanded P . . . commodity’s price Ps . . .price of substitutegood Y . . . disposable income We simplify: Q = β0 + β1P + β2Ps + β3Y We estimate: Q = 31.50 − 0.73P + 0.11Ps + 0.23Y ECONOMETRIC MODELS 12/33 • Today’s econometrics deals with different, even very general models • During this course we will cover linear regression models • Wewill see how these models are estimatedby • Ordinary Least Squares (OLS) • Generalized Least Squares (GLS) • Instrumental Variables (IV) • Wewill perform estimation on different types ofdata DATA USED IN ECONOMETRICS 13/33 cross-section sample of units (eg. firms, individuals) taken at a given point in time repeated cross-section several independent samples of units (eg. firms, individuals) taken at different points in time time-series observations of variable(s) in different points in time (eg. GDP) panel data time series for each cross-sectional unit in the data set (eg. GDP of various countries) DATA USED IN ECONOMETRICS -EXAMPLES 15/33 • Country’s macroeconomic indicators (GDP, inflation rate, net exports, etc.) month by month • Data about firms’ employees or financial indicators as of the end of the year • Records of bank clients who were given a loan • Annual social security or tax records of individual workers STEPS OF AN ECONOMETRIC ANALYSIS 16/33 1. Formulation of an economic model (rigorous or intuitive) 2. Formulation of an econometric model based on the economic model 3. Collection of data 4. Estimation of the econometric model 5. Interpretation of results EXAMPLE - ECONOMIC MODEL 17/33 • Denote: p c .. . price of the good . . . firm’s average cost per one unit of output q(p) .. . demand for firm’soutput Demand for good: q(p) = a − b ·p Firm profit: π= q(p) ·(p −c) • Derive: bq = a − ·c 2 2 • Wecall q dependent variable and c explanatoryvariable EXAMPLE - ECONOMETRIC MODEL • Write the relationship in a simple linearform q = β0 +β1c 0 1 a b 2 2 18/33 (have in mind that β = and β = − • There are other (unpredictable) things that influence firms’ sales ⇒ add disturbance term q = β0 + β1c +ε • Find the value of parameters β1 (slope) and β0(intercept) EXAMPLE - DATA 19/33 • Ideally: investigate all firms in theeconomy • Reality: investigate a sample offirms Weneed a random (unbiased) sample of firms • Collect data: Firm 1 2 3 4 5 6 q 15 32 52 14 37 27 c 294 247 153 350 173 218 EXAMPLE - DATA 10204050 Outpu t30 150 200 250 300 350 20/33 Average cost EXAMPLE - ESTIMATION 10204050 Outpu t30 150 200 250 300 350 21/33 Average cost EXAMPLE - ESTIMATION1050 Output 203040 150 200 300 350 22/33 rageAve 250 cost OLS method: Make the fit as good as possible ⇓ Make the misfit as low as possible Minimize the (vertical)distance between data points and regression line Minimize the sum of squared deviations TERMINOLOGY 23/33 yi = β0 + β1xi + εi . .. regression line yi . . . dependent/explained variable (i-thobservation) xi . . .independent/explanatory variable (i-th observation) εi . . .random error term/disturbance (of i-th observation) β0 ... intercept parameter ( β^0... estimate of this parameter) β1 ... slope parameter ( β^1... estimate of this parameter) ORDINARY LEAST SQUARES • OLS = fitting the regression line by minimizing the sum of vertical distance between the regression line and the observed points 104050 Output 2030 150 200 300 350 rageAve 250 cost 24/33 ORDINARY LEAST SQUARES - PRINCIPLE 25/33 • Take the squared differences between observed point yi and regression line β0 + β1xi: 𝜀𝑖 2 =(yi − β0 −β1xi)2 • Sum them over all n observations: ^0 ^1• Find β and β such that they minimize this sum ORDINARY LEAST SQUARES - DERIVATION 26/33 RESIDUAL 27/33 • Residual is the vertical difference between the estimated regression line and the observation points • OLS minimizes the sum of squares of allresiduals • It is the difference between the true value yi and the estimated value • Wedefine: • Residual ei (observed) is not the same as the disturbance εi (unobserved)!!! • Residual is an estimate of the disturbance: ^ RESIDUAL VS. DISTURBANCE 10204050 Outpu t30 150 200 250 300 350 Average cost True relationship Estimated relationship Disturbance Residual 28/33 GETTING BACK TO THE EXAMPLE • Wehave the economicmodel bq = a − ·c 2 2 • Weestimate qi = β0 + β1ci +εi 0 a 2(having in mind that β = and β1 b 2 29/33 = − ) • Over data: Firm 1 2 3 4 5 6 q 15 32 52 14 37 27 c 294 247 153 350 173 218 GETTING BACK TO THE EXAMPLE When we plug in the formula: 29/33 -0.177 GETTING BACK TO THE EXAMPLE • When we plug in the formula: 29/33 -0.177 -0.177c 0.353 MEANING OF REGRESSION COEFFICIENT 30/33 • Consider themodel q = β0 + β1c ^q = 71.74 −1.77cestimated as q . . . demand forfirm’s output c . . . firm’s average cost per unit of output • Meaning of β1 is the impact of a one unit increase in c on the dependent variable q • When average costs increase by 1 unit, quantity demanded decreases by 1.77 units - 0.177c BEHIND THE ERROR TERM 31/33 • The stochastic error term must be present in a regression equation because of: 1. omission of many minor influences (unavailable data) 2. measurement error 3. possibly incorrect functional form 4. stochastic character of unpredictable human behavior • Remember that all of these factors are included in the error term and may alter its properties • The properties of the error term determine the properties of the estimates SUMMARY • Wehave learned that an econometric analysis consistsof 1. definition of the model 2. estimation 3. interpretation • Wehave explained the principle of OLS: minimizing the sum of squared differences between the observations and the regression line • Wehave derived the formulas of theestimates: 32/33 WHAT’S NEXT 33/33 In the next lectures, we will ▪ derive estimation formulas for multivariate models ▪ specify properties of the OLS estimator ▪ start using Gretl for data description and estimation