Regression
'"X
\
In this chapter we tackle how to conduct regression analyses in Stata. We concentrate on ordinary least squares (OI.S) regression, which requires a reasonably normally distributed, interval level dependent variable, and logistic regression, which requires a dichotomous or binary dependent variable. We briefly introduce commands for multinomial logistic and ordered logistic regression models, the parallel family of commands for binary, multinomial and ordered probit, and Poisson or negative binomial models for a count dependent variable (see Box 8.1).
Among all of this wc also look at the characteristics and effects of the independent variables. The majority of these techniques can be used on independent variables in any of the regression models mentioned above with more or less ease of interpretation! We will also discuss ways of dealing with categorical independent variables, non-linear associations, interaction effects, as well as regression diagnostics.
ORDINARY LEAST SQUARES REGRESSION
To carry out a hivanate regression use the regress (or rag) command, immediately followed by the Y (dependent) variable and the X (independent) variable.
regress Y X
I here are a number of variables in the example data set that are suitable for regression. In this example we will use monthly income {fimn) as our dependent variable, but considering onlv those in paid employment. We could use an if statement so that our regressions are only done where jbntat»-2. Another way would be use a ka«p command so that only those people in employment
Ordinary least squares regression 281
Box Ž. >" Ciiarscterisíšes ot dependent variable* and r&sisss'^fi models
\
Ordinary feast squares SOLS)
No   Y es
No   Yes Don't krow
Binary logiíiťr, logif or probit.
»Ailtinoinwncg:stie. logit or probit
Ordered logistic, logit or probit
Strongíy - -> Strongly agree disagree
P ilsson or nec,aliv«
binomial
2 3
are kept- ln rhe active data set: keep if jbstat==2. We will use the latter.
Ir is usual for income distributions to be positively skewed (or skewed to the right;, in which case a natural logarithm transformation usually helps to bring the distribution closer to normality. As
282 Regression
we have done before in Chapter 5, we can check the skewness of the original income variable and the transformed variable:
C*n In lncin(fIblq)
tabstat fiiian ln_inc,  s(sk fair)
.   gen irj__inc-ln ( f iron) .   tabstat   fiirtri   ln_inc,s(sk kur) stats   I fimn lr._.inc
skewness | 2.473931 -.5381867 kurtosis   |      IS.37824 3.537047
We can see from the output that the skewness and kurtosis of the variable has been considerably reduced and brought much closer to normality by the transformation. We will now use the Injnc variable as our dependent variable.
First, we will use a bivariate regression to see if age is a significant determinant of income.
Number        cbs - 4973
r; i, 4871)     :=   ai.ab
Prol:  > ? =  0 . 0000
Ad: R-3qaart7d = 0.00BS Rnot MSE =   . C-8bb3
l:l_irK-   |        Coel.   St'd .   Err. t    P>|tj     r 3b % "orf.   In!   l   i 1
ag~:   |   .0002940     .0007396      6.71    0.000 .000746S ,_cunr.   |   O.Or/S-?     .0313SfP  ')C1.84    it. 000    6.4S61GS        i  r"c<(   M I
The regression output is relatively concise; check the equi« dent 1 output in SPSS if you don't believe us. In the upper right-hand side is the information concerning the number of observations used j in the model as the regress command uses listwise deletion (i.e. *• only cases with non-missing values on all the variables in the model will be included) and model 'fir" statistics. The most commonly i used 'fit' statistic in OLS regression is R1 (R-squared on the output).
Mocel [ 21.128328b 1 21.1283283 FUi.-ji.dra]   |   2330 .13006 197     .469052o38
7'otal   I   2107.26359 49'?
Ordinary least squares regression 283
This indicates the amount of variance in the dependent variable explained by the independent variables; the higher the value, the more explanatory power the mode! has generally. Also in the upper right-hand corner are an f statistic and its associated p value, which becomes more useful when you are working with nested models or adding blocks of independent variables. At this stage, we suggest you do not to concern yourself with the adjusted R1 and mean squared error statistics (Adj R-squared and Root MSB respectively on the output}. The upper kit-hand side of the output presents the sums of squares derails as you would get from ANOVA. The lower panel of the output shows the regression coefficients, standard errors, l values, p values and 95% confidence intervals of those coefficients in each row. The bottom row starting with . .cons is the intercept or constant for the model.
We see that even though the coefficient for age is 0.005 and is significant {t = 6.71 and p = 0.000) this isn't a very good model (btvanate models often arc- not), as the PJ value is very low at 0.009 -less than 1 %. It may be that the association between the dependent variable and the independent variable is not linear. Previous research nifonris ns that age often has a curvilinear relationship with income in that income initially increases with age and then decreases. We can check if this is the case m these data with a scatterplot with age as the X variable and Injnc as the Y variable.
scatter  In ine age
2 J______r____________
20 40 60 80
age
284 Regression
Capturing this type of non-linear association often requires the addition of a squared term of the independent variable.
gen agasq=age*age
or
gen agesq=ageA2
If we rerun our regression with the transformed income variable and the age variable plus its square we see that our model tits much better; Rz is now 0.067 (6.7%). The significance of me age terms {age and agesq) tells us that we were correct to assume a curvilinear relationship.
regress In inc age agesq
. One agfe ^.ge^/q
nuju^er of obs - -1570
F(  2,   4 970! - 1«0.0?
Prob > P - 0.0000
Adj K-yquared - 0.0070 O.ouc MSI". -   . 06000
ln_. 'n.-   i        Cost,  sou.  Elf . i.    p> I •■ I     [900 Cent,  Inr.erva 1 j
,7ga 1 , 0040960 .004000.7 :c..59 0.000 .0700:149 .0900774 ayr:3rr j .0009009 .0000500 -17,67 0.00O .001_C97 -.0008S8 _c:rnn   ;   j. 110800     .OSOOOl'-'    00.1?    0.00C      4.94/090 0.080612
We can now add some additional predictors of income to our model. First we add a variable for gender. The current variable sex is coded 1 -- male and 2 = female. We can either recode this to a dummy variable (0,1) for either males or females or use the xi command (see also Box 8.2). If we prefix the regress command with xi: and putting an i. in front of our categorical variables of interest, Srata automatically converts them to dummy variables in our regression equation, xi expands terms containing categorical variables into indicator (also called dummy) variable sets by creating new variables and then executes the specified command with the expanded terms.
Model   I     109.230417 2 70.016700/
Hidu^l   I     21.90.00047    49/0     . 44? '  0 09:'
.20000 4970
Ordinary least squares regression 285
sjj sregress Xn__inc age ag^sq i.sex
:        ..Li  inc. c.ce acf^oq _. sex
..sax 1-2 (natura__y cod^d;   _Isex_I ornitx^d)
ss    a: ms		Number of oh?	4973
.....- i ----- ------- ---------		; 3 . iy63;	= 641.84
I - 1  1   078.344919       J 219.44830«		Prcb > P	= G.0000
^1       i n98.L'1897  49c9 .14190355b			= 0.2793
----- ----- ----------		Actj  R- squared	= 0.2^83
V77.2 = 389  4972   . 4 741C7 ;81		Rcct USL	=   . b 'c 4 " 3
n„ .ac   |         Coef.   Sid.   Sit . t	P:	-11 1     iy5% Oonf.	Interval]
J905245     .0039BS6 22.71	0 .	. 000       . 0827039	.09834
n r„u   1-  0010799     .0000497 -21.71	0 .	. UjO    - . 0011774	- . 0009824
;3429S3       ,0:)0999 -38.23	0	.000 -.6567463	- . 601664
,22339     .0749394 70.97		.000 5.1.'a327	5 . 4693^2
N.ik that at the bottom of the variable list a new variable [JtfiJl luis appeared. This is the dummy variable automatically arated b> Stata for the original variable sex. The output line rnmeui.itely after the command shows how Stata has created dummy or indicator variables our of the sex variable:
i     _     _I._-?x_l 2    tna toiral ty coded; _.7seK_l omitted)
I'hb line shows at the left that the variable sex was indicated with an i. pnliv in the command line. The next part (_Isex_l-2) shows that indicator variables have been created        and that
" thi sex iambic has categories valued from 1 to 2 (_l-2). It then tells vou on the right that the category with the value 1 is the oinit'rd (or reference) category. By default, the dummy-variable vr i- tdiaufied by dropping the dummy corresponding to the omall.^t "jiue of the variable.
So in this case the indicator variable created by Stata is lor fennles (as females are sex=2) compared to males. The negative .-'•efficient fur the variable _hex_2 shows the mean difference for iioinen compared 10 men in logged income. In other words, i 'i -men on average earn less than men after controlling (adjusting)
j for agr We can also see from the output that the R2 value has »«.io.>-».d to 0.2~9 (27.9%) from 6.7% in the model with just age ind ago; as independent variables. This indicates that age and
J cj. explain nearly .28% of the variation in logged income for '.hose in employment.
286 Regression
Next we enter marital status (mastat] as an independent variable, also using the i . prefix in the following command:
xi:res In inc age agesg i.sex i.mastat
.  xi:reg ii,  inc aye Sires::   i .sex i .Tiasi-ao
i.sex _Isex...1-2 insd-l-rally  coded;  _,sex_l omitted;
i,ma.,tat.       _2.~ia9Lr-.:_l-i:        <: adt.ur nlixy coded:  _ i mastar._l. omjrtcd)
574100  J 3,5     dl .9.2 Kuinii-ei  ot ohs - ^973
---- .-----+ ---------------------- ----- p .   3>   49 64 i =   o d ti . 26
Kod.-'l       686.879702        8  85.8599628 Proio >  F =  C . OOli'C
Ko.scd..a3   |   267C.3811?  iilSl   .336499034 R-squarod = 0.291«
-------------,--------------------- ------. A,o:   R-sqaared = 0.2902
Toi..1?l      2357.26389  4372   .47211.7782 iiootKSS = .980(19
]4_2no  | Coat-   Ktc.   Eir. I:   P.-l: j   ;9 = 8 Conf.   Interval j
ago j .9939.282 .0C4o725 20.54 0.200 ,084964-j .102=919
aqeoq | -.0011177 .09003148 -20.40 0.000 -.0012253- -.0010103
Jssa 2 1 -.1-49.9603 .0166141 -29.06 9.000 ,6821273 -.016.3893
..Imastat. 2 : .2093674 .0314092 6.05 0.000 .147672 ,7710608
2i:,9,3Cat_3 j .3.10369 .27079.l3 4.38 8.000 171.586'' .4421823
. lmastaL_4 | -1837"i69 .0411 1 3? 4.23 0.000 .OCPdlOb .2.693131
..J mastaC .5 I .1 691297 ,0628323 2. el 9 0.007 .84.59506 .2923088
l-3astat_..6 j .023,-757 .026573-4 0,91 0.202 -.02782 /9 .0729792
cons 1 5.222049 .0934902 53.88 0.000 -.058056    5.425328
In this example, the reference category for marital status (mastat) is 'married' as thai category has the lowest value (1). Remember that tlie coefficients for the other dummy variables are all compared to the 'married' category. So, for example, category 2 'living as a couple' has a significant coefficient of 0.209 which indicates that those living as a couple, on average, earn more than those who arc married, after controlling for age and sex.
if you want your reference categories to be something else, you can change them with the char command (short for 'characteristics'). It wc wanted to make 'never married' the reference category, then we would use:
char mastat[emit] 6
as the 'never married' category has a value of 6. Now we can rerun the regression command. To restore to the default reference categories type use;
char mastat[omit]
Ordinary least squares regression 287
Box 8,2; Using the xt command for Interaction*
Tn& xi command also allows us to- do .nte-or^ons easily. The 1. var syntax isinterpreteo as follows,
•    i.varl creates dummies for catego art,
\ ,vsr l.'i .vari erea:;;: ocv-i.nies :c_ oatwgoJ'cal variable; rar? and var2: mam effects and inter i
i.varl*vax3 creates dummies ^or cateyo- cal variable raff and included continuous variable vs'3: =!i infractions and main effects.
i ■ ■'-•*rl j---ar.'i cnateo o\':t!'t»sfor oa'ogo-icai var able van and includes continuous variable m& »'! hteraotions arid main effect of var3. but not main affect or
We can also use the if and bysort commands with regress.
For example, if you were interested in running different regression models for each sex then you could use the if command;
xi:reg In inc age agesg i.mastat if eax-=l xitreg lia i«e age agesq ±«!ft&fctat if sejs-«2
If you put one or more instances of i . in your command you must put the xi: first then the bysort command:
ici : bysort sexs reg ln__inc age agesq i-mastat
There are some commands that cannot he combined with by and/or bysort. 11 you try to combine them, Stata will give you an error message to this effect.
You could include the indicator or dummy variables by making them using the tab command with the gen option. For example, using muskil-l as the reference category:
tab niasta.t» gendristat)
bysorfe ftflx:  reg lu inc age agesq mstafc2-mstat6
Another slight tweak to the process would be to generate the dummy variables using the tab command but then to drop the reference category variable and use an * for all the dummy variables.
288 Regression
Putting an asterisk alter the common part oi the variable name tells Stata to include ali variables that start with that common part; so, mstat* will include all variables that start with tnstat. * is the Stata wildcard notation. So the commands would be:
tab mastat, gen(mstat) drop mstat 1
bysort sex:  rag In inc age agesq mstat*
As the tab command creates dummy variables for every category of the mastat variable, if we did not drop the reference category variable using the * wildcard we would put all dummy variables into the regression. Stata will produce results but it will decide which one of the dummy variables to drop and you lose control over the reference category.
Two of the common options lor use with regress are:
• beta, which requests that normalized beta coefficients be reported instead ot confidence intervals;
• level(#), which specifies the confidence level, as a percentage, for confidence intervals of the coefficients.
Regression diagnostics
Stata comes with a series of graphs to help assess whether or not your regression models meet some of the assumptions of linear regression. Using the pull-down menu, these are found at
Graphics    Regression diagnostic plots
Before going on to the diagnostics, we will briefly discuss regression assumptions, fuller discussions are available in most statistical text books, but we suggest reading Berk (2003) for a general critique of the regression method and its common abuses, while Belsley et al. (2004), Fox (1991) and Pedhazur (1997) are good texts for the assumptions and diagnostics (see also Box 8.3).
The main assumptions of OLS regression are as follows:
1. The independent variables are measured without error.
2. The model is properly specified so that it includes all relevant variables and excludes irrelevant variables.
3. The associations between the independent variables and the dependent variable are linear.
Ordinary least squares regression 2S9
Box 8.3: Errors and ERRORS ■ •
Ons of the things that sf uck in our minds as students was a' short section m Pedhazur (1997: 8! titled 'There are errors afd there a,e ERRORS' in which he encourages researchers,to find the balance betweer failing to meet the assumptions of statistical techniques (or not oaring if they are met or not), and the debilrtat-tng quest for statistical perfection jn real-world research and d$ta.
Some of the assumptions are testable; others are not and have to be justified by logic and argument. Therefore', no matter -how many staiisticaf/diagnostic tests you run there will stiff be a ' possibility theft you have violated,one of the many assumptions. So. to avoid the patafysisof perfection we encourage you to adopt Pedhazur's approach and balance your investigations with some pragmatism: is it an error or an ERROR?
... understanding when violations of assumptions lead .to ' serious biases, and when they are o1 little consequence, are essential to meaningful data analysis.
(Pedhazur 1997:33)
4. The errors are normally distributed. JBrrors are the difference between predicted and actual values for each case. Predicted values are also called fitted values. Knots are also called residuals or disturbances.
5. The variance of the errors is constant; usually referred to as homoscedasticity. If the errors do not have constant variance they are hereroscedastic.
6. The errors of one observation are not correlated with the errors ot any other observation.
7. The errors are not correlated with any ot the independent variables.
Thin there are a number of what we call 'technical' issues that you need to check:
8. Strange cases or outliers: these may be from coding errors or may be truly different in which case you may need to examine them further in detail.
9. Leverage and influence: to determine if any of the cases have undue leverage or power on the regression line.
290 Regression
10. Mukicohineamy: it the independent variables are highly correlated with one another this may affect the regression estimates,
The first assumption is extremely difficult to meet, it not impossible in social research. .Measurement error in tne independent variables usually results in underestimating the effects, and the extent of the underestimation has been shown to be linked to the reliability of the measure (Pedhazur 1997), We are guilty of violating this assumption ourselves in die examples in this book. Is the GHQ a completely valid and reliable measure of mental well-being5 Not at all, but our models do not take that into account. If you are interested in combining measurement and effect models we suggest yon delve into structural equation modelling. It's worth noting that measurement error in the dependent variable does not bias the estimates but does inflate their standard errors, which then gives a higher p value and so a weakened test of significance.
The second assumption, model specification, has to be addressed theoretically, practically, as well as statistically. In developing models to test, the theory needs to be complete, and testable, for the model to be correctly specified. Practical issues such as data availability may also hinder you in specifying a correct model. There arc commands in Stata that test whether you have omitted relevant variables. They don't tell you what they arc! Nor do they tell you if you have included irrelevant variables. We cover die linktest and ovtest commands as we go through oris example.
The third assumption of linearity is a variation on the second assumption, and we have already discussed ways oi dealing with non-linear associations. There are tests for non-llneanty but we suggest that these are largely unnecessary it you conduct in-depth univariate and bivariate data analysis before moving on to multi-variate analysis.
The distribution of the errors/residuals can be easily attended to after a regression command and the distribution can be visually inspected in graphs and then formally rested using the normality tests covered in Chapter 5- We look at the rdplot and qnora graphs; as well as summary statistics commands such as su and tabstat combined with appropriate normality tests in our example.
'lb see if the variance of the errors is homosccdastic we can ploi the errors (residuals) against the predicted (fitted) values in
a scatterplot. In such a plot we are looking for no discernable
Ordinary least squares regression 291
tufrcin a-id tliar' he i jsiduais arc in an even band across all of the mediate \ due-, This is created by the rvfplot command. We
-JT all' tr t this using the hettest command.
T"< *is-fh issumption oi non-correlated errors is difficult to a 'r-s, ,ind v itl» moit non-experimental data it is probably safer to jsjw- tii u thcsi. exist, rather than that they don't! Cluster •ainphng stiateg t > v ill almost certainly mean this assumption is • . iar. d A .'tin, we have fallen foul of this assumption in our rumples as uc arc using household data and the people who shut the - line household are probably more alike than those «,no iifiii'i. The eftf I is to underestimate the standard errors of iji i ocflkifiirs oi the independent variables, possibly giving i.h i'lciem, -.t.iHjfual significance when they shouldn't. A coni-.'ion .ohm m v* hen usmg cross-sectional data is to use robust sM*,d,'rd lj u»s. i hr, can be done in our regression by either using we (robust) is m option or, better, as we know that individuals in * insteted ui households in our data, the cluster (hid) option:
jri-reg In inc age agesq i.s«x i.maatat, ///
cluster(hid)
W «vlm happens to the standard errors, / values and p values oinpared t > the onginal, partial, output shown below. The .or Hi, ,tou h we rammed the same but the standard errors have increased tc.ultin0 in lower / values:
-1 l 1    re, ,	0050763	.8	50	0	0 00	. OSS976-1	. '! 05 88
--.r | mil'	0000600	-18	39	(J	000 •	. 001 2369	.000998b
. |	i   .016 6677	-88	95	o	000	. OS'1 6271	. 6162917'
11>- |       •' 18	.00457?.:,	20	54	0	00 0	. 0S4964S	. 1028919
--<i   |   -   J 111 i ,	.. 000OS-IS	-tO	■10	0	00(1 -	. 0012351	. DC 101 9 7
I    -    r, 1' 1,1	.. 0166143	-SO	0 5	0	0 00 -	. 6815323	.6163895
I In hi i assumption is linked to model specification, especially m noil expell'limr.il data. It follows that if there is an omitted '•a'idbie trial is lis,; correlated with one of the independent variable t!i'ii. is uic effect of that omitted variable is in the error it'in, (ben thi emor- will be correlated with the independent >.'ii~>bh\ r,,r e\ imp' :, suppose we were investigating children's ''du. atonal attainment with a model that had parents' education, s'vial cla-s, leajoine area and number of siblings as independent uii ibi s Parent-,' i iconic is not available and so is nor in the
292 Regression
model. However, we know thai parents' education and income ate likely to be correlated. Therefore, the error term, which includes the effect of parents income, will be correlated with the included independent variable parents' educarion.
The three technical issues are discussed more as we work through our example.
We suggest that you adopt a systematic approach to regression diagnostics, and as the diagnostic commands to be used after every regression are generic you could easily copy and paste a set of diagnostic commands into a do file after each regression. This way you know that you haven't missed anything. Such an annotated do file is shown in Box 8.4.
Box 8.4: Diagnostic commands
This surnmaty of Siata commands and the assumption or technical issue they help check for is adapted from Chen et al, (2003).
Model specification
linktest   performs a link test for model specification, ovtest      performs regression specification error test for omitted variables.
Normality of errors ■ .
rdpiot      graphs a histogram of the residuals. Use
fiodit rdpiot to install, puon.       graphs a standardized normal probability plot swilk       performs the Shap;ro-Wilk It/test for normality.
Homosoedasticity
rvfplot    graphs residual-versus-fitted plot, hettast    performs Cook and Weisberg test for heteroscedasticity.
Leverage and influence
predict    create predicted values, residuals, and
measures of influence. nrSplor.    graphs residual-versus-fitted plot. lvr2plot   graphs a leverage-versus-squared-residual plot df beta      calculates DFBETAs for al! the independent
variables
Ordinary least squares regression 293
MuliiCcJlT»arity       .'       , !
vi£ caiouiates the variance inftatifen factor for the
independent variables.
An .example do'file for regression diagnostics is shown be'.c"'- There are other tests and graphs thai you may" wish to add
*- Model Specification *k lirJctest   /* performs a iiak test '. for model specification ■
Look for* _ha't being sig p< . 05 and Jnatsq being not, sig p>.C5 _hatsc; not sig means' ao omitted vara if _hatsq sig tnen omicced vars "/
ovtest       /* performs regression
specification error test for omitted variables. Lcok for p>-Q5 so not. to reject hypothesis: model has no omitted vars'/
** Normality of errors ** predict res,res /* use predict to
1      '      '     create new var res (residuals I */ predict sites,  rsta   /* use predict to
create standardized
idplot   /* graphs a histogram 'of the
residteIs, 1
Look for a normal distribution
with no outliers */ ** save graph'?
it you haven'c installed rdplot then use: hietogiam res
pnornt res   /*' graphs a standardized normal probability (P-Pj plot of res
204 Regression
Ordinary least squares regression 295
predict cooks, cooksd       create Cock's D statts critical !,       value -4/H *'/
dfbets   ' /* calculates
DFBSrSs for all the independent variables critical value-',2/sqroot -N */
su lev cooks DP* /* summary stats
■for inspection and, checking against critical values v '■
ivr^pioc   /* graphs a leverage-versus-rscjuarecl-residual plot Look for'cases with large 3 average-values */' '
** fiulti colliriearity **
vif   /' calculates the variance inflation factor" for ind vars , Look for VIF > 10 or I/VIb* {tolerance) < 0.1 */
drop res stres lev cooks DP* /*otherwise
error
after next regfrssion! */
A feature of the estimation procedures in Statu is the post-estimation commands; type help postest for an introduction. However, there is usually more specific information about post-estimation commands in the sections on the estimation commands, such as regress, themselves. Many of the post-estimation commands we cover here are straightforward to apply after a regress command, but the results can be obtained in other ways (no surprise there, then!) and most is done through the predict command and its options.
We now follow on from our regression example where we had income with a logarithmic transformation as the dependent variable and then age, age squared, sex, and marital status as
296 Regression
independent variables in a sample of employed people. While these independent variables explain almost 30% of the variance in logged income, we are not expecting this to he a satisfactory model as we all could think of a number of other factors that would have an effect on income. However, let's proceed with the diagnostics for that model using the do file commands in Box 8.4.
First, we use the two tests of model specification: linktest and ovtest.. These tests use a similar process whereby new variables are created and then tested in the model. The linktest results are more transparent as they are displayed as a usual regression output, whereas the ovtest produces just a single test statistic and its p value. We have annotated the do file to indicate what to look tor in these test results so you can see that both indicate that we have omitted variables, which is not a surprise.
. Me.ae_  Sneoi f\caticn   - '■
.   linktest    / *" p:. 'forms a .liiiK ".stji  fo'f model  specilicati on
Looir  Eor _hat oeinq s.Lg p<.03 and _ha„sq being not
hr-it-c not  sic; j items no orritted varf, > 'f .J'fitsq sig  ch==ui omitted vars   '.-'
Scurco SS a- MS Nnmoer of obg - / ■■ ------------------------------- ---- K (         4 9"- 01
''•Ir^iol. 691.VX17?        2  ?.4b.8BSS86 ;jrob > F Rer,idi;,Tl :   1. b b b .^21/  49"'0   . i 3 b 109 1 L"< B squared --------- i----------------.......------ Adj R-SC[uared
Total j   2357.263B9  497:?   . 17dl0V,<;;l -oot MSE
It: Int.;   | eoef.   Sid.   Err. z    P> | t.'     [95% Conf . :
Jiii    I     2 . OCVl'j-j     . 7fi74T3?  -2 , 55     0 .011     -3 . = 5Ci= 5^        4    4 » hdt-;-q   !     ,22-:C0H/     .0SR7f:99    3.82     0.000 -1093546 co::n         10.G?-1~ib    2.6rrJbM.4    3.81     0.000      4.377171
.   oviesl    / * pr. rforms regressicn :-;poci r icatior.  t=rror c o« t om^rtod variables
> ! ioc;k  Cor p> . 0d        ;ict to rej ect hypo t:ien.i f ■, mc
> has :tc omitted vars'/
kansriy r'.ESE'i1 te-r usinq pev-c.-t^ cf  the  fitted  vhI j«y  of ln_mc Fc:  noclol has no Oirittod variables F ! 3 .   4 9 6) =44.95 Frob > F      = 0.0000
In this next step we use the predict command to create two new variables: one for the errors or residuals and one for the standardized residuals. We examine the distribution of the errors
Ordinary least squares regression 297
Box 8.5: Alternative diagnostic commands
Tne liaktest command can be replicated by;
pradict yhat, ufo gen yhat »<3»yhac * 2
roQ In tnc yhat yhafcs^
The xb option to predict creates a new variable of the predicted (fitted) values of Y for each case. '" ' The rdplot can also be produced by;
predict r«s, r hi»tv>r/r«i« res
The r option to predict creates a new variable of the residuals (error) for each case. You may also want to examine standardized residuals, in which case use the rsta option and graph these.
predict, stx-as. rsta 'h.LanocrraiTi stres
The rvfplot can also be produced (assuming you have already created the yhvl and res variables) by.
scatter res yhat
or residuals by first visually inspecting two graphs. The rdplot command needs to he installed, so type findit rdplot and follow the instructions (see also Box 8.5). If you haven't done this, you can still use histogram res to get a similar graph. The histogram shows that there is a longer negative tail on the dis-tuluition, indicating that ir is probably negatively shewed. In the pnorm graph there is also a departure from the diagonal. You may also want to add a qnorm plot here.
predict res,res . predict stres, rsta
/* use predict to create new var res (restduals) */' /* use predict to create standardized res */
298 Regression
. ** Normality of errors »*
. rdplot   /* graphs a histogram of the residuals
> Look for a normal distribution
> wrth no outliers  */
-3 -___i________________
0 100 200 300 400 500
Frequency
. pnorrr, res   /* graphs a standardized normal
> probability   (P-P)  plot of res
> Look for plot to be close to
> diagonal k!
0.00 0.25 0.50 0.75 1.00
Empirical P[i] = i/(N-1)
Ordinary least squares regression 299
Next, we inspect the summary statistics of the two new variables of the residuals and the standardized residuals using the sv. command. We can see that there are cases with standardized residual* considerably larger than 3, or even 3.5. This indicates that wc probably hai e outliers and that the residuals may not be normal!)' distributed,
o i  res orres    / *  summary statistics lor res
Look for mean=0 and no :tin and max
Vi-l-able  |    Obs Meat. st.d.  Dev. Min Max
res  :   4973 -1.47c-10       .b79619 -2 .696645 2.102549 .ir.ies  j   4973  - .0000136    1.0001 1 9 -4. 660335 3.403345
We torrnally test the distribution of the errors using the normality tests shown in Chapter .5. These also confirm that the distribution Hi-parts from normality in both skewness and kurtosis.
.  owilk x'es   /* performs the Shaplro-wilk W teat for normality on res testing hypothesis of norma': ity so-p*,-rOo- reje.-ls V
Sncip.lro-^'iix W teat  for normal data Vr.ii able   | Obs W V 2 Pr ob>?.
res   I        4973        0.93902        29.619        8.88» 0.00000
;:M.coL  res  /*  lor:  larger  aa:npl es
tea' i ng hypothesi r. of norma.! ity sc p< . 06 rejects * /
Skeooes.s/bui losi a tests  I or horrr.al.LLy
.....- joint----
it  nbje  |   er (Skcwr.eas i  i'r ..Kurt csi si  adj  c:V: i 2 (2 » Prob>chi2
res  i 0.000 0.0OC . 0.0000
r.tat ces,   ^ b;k kur)   /* to actually see the skew and bat :: stats  r. eraember r:<-   skew      0,   to kurt  -  3 */
.-,-j.-,o!a  j        skewrieot kcrtosis
res -.4042151        3. "-'9337
ZOO Regression
The visual inspection of the graph of fitted values (predicted values) against residuals (errors) clearly shows that the varian. e of the errors is not constant across the range of fitted values. Therefore, we Have violated the assumption of homoscedasficm. This is confirmed by the statistical test which rejects the hypothesis ot constant variance.
.   'k'k Homoscedascicity
. rvfplot   /* graphs residual-versus-fitted p_ot
> Look for even distribution "no
> pattern" and possible cases of high
> influence */
.  bettest /* perforins Cook and Weisberg test
> for neteroscedasticity testing
> hypothesis of constant variance so
> p<.üb rejects  * /
Breusch-Pagan  / Cook-Weisbexg test for heteroskedasticity
Ho:  Constant variance
Variables:   fitted values ot ln_iac
cni2(1) = 190.58
Prob > chi2 = 0.00 00
Ordinary least squares regression 301
For this next part of the diagnostics we create the leverage and influence values for each case in new variables. The leverage and Cook's D values are created using the predict command, and the dffoeta command automatically produces a OF BETA value for all the independent variables. We then use the su command to make a table so that we can see if any of the values are greater than the critical values. This isn't the place to engage in a debate on the use of critical values or cut-offs, but just to say that these are rules of thumb rather than commandments set in stone. One point to think about is the effect of having N in the denominator in these calculations when using samples in the many thousands. In our current model we have eight independent variables so k - 8, and an estimation sample of 4973 so N - 4973. Accordingly, the critical values are: leverage, 0,00362; Cook's D, 0,0008; and DFBETA, 0,02836, All of the values indicate that there are cases that have undue leverage and/or influence in this model. The leverage-residual plot clearly shows that there are quite a few cases with high leverage values.
*■ Leverage and influence verage /
predict lev, predict cooks dfbeta
DFage DFagesq _Xsex_2
DF ..Imastat DF.. Iraastat DF_.Lmastat
/'■' create leverage values
L leal   value  2lk+l)/K */ /" create Cock's D stats
critical value 4/K */ /"  calculates DFBETAS for all the independent variables cri t i c.a 1 va "I ue 2 /sqroot N DFbeta(age) DFbeta(agesq) DFbeta (__lsex_2] DFbe t: a („„Imas t at_. DFbeL-a(_ImasLat-._ DFbeta (_._Imastat... DFbeta( _imastat._ DFbeta (_imastat__6)
;i] lev cooks DF"-*
/■'■ summary stats for inspection and checking against crit ical values "•" /
Mean Std.   Dev. Min JV
302 Regression
.025 -.02 h *
0 .001 .002 .003 .004
Normalized residual squared
finally, wí examine -whether an) oi the independent variables are collmear as a check for muiticollinearity. As a precursor to regression you should be looking at the bivariate associations between potential independent variables which would give an early warning about issues of muiticollinearity. The viř command produces the variance inflation factor and the tolerance, which is simply the reciprocal of the variance inflation factor and preferred by some users. Our results show that there is coliincarity between age and age squared, bat is to be expected as they have an almost perfect linear correlation! All of the other independent variables have variance inflation factors less than 10 or tolerances greater than 0.1, which shows that muiticollinearity docs not exist.
.   1 ■   Mj_.-ic<;j .1 í.:ik.-ji i ty **
.  s'if    i'k calculaoei; ..l-io var iar.ce inflation factor for ire vat... Look for yrf > 10 or  1/VfF  (tolerance)   <  0.1 /
age I 46.S3 0.021356
aaeeq : -13.1 7 0.023162
„iinastat. .6 - l.'/o c. 3136526
_lmu3tat_2 i 1.16 C.S61387
_Tiriasl:at..J | 1.06 0.94629.3
...Iiu^oi ai._4 I 1.03 0.972C71
_Ioox._2 j 1 o'l" 0-980592.
_lnastar  :, | l.u 0.986C29
Moan VI? | ±2.12
drop oes sores lev cooks Dl* /*
regessior.!
Logistic regression
v ">hat dov ail this mean tor our regression model? In terms •1 noli 1  p< edicdtion, it is not surprising that these results indi-
< in- di <t v> e '> n e on ^oed variables; no one would think that age, >\   ij r iri'V stitns alone would satisfactorily explain varia-
" «i> ui >n orni Somi of the omitted variables could be education,
* oil eae ence, md ,ector of industry, for example. The errors ■ . it diul a e v cl' dispersed beyond the normal distribution,
• nli ome -t uidaidized residuals beyond 3.5. The error terms are
0 o 'i.o i mi da tn which is more than likely linked with the n > i, ti< o 1 -;\< if ition, A good number of cases have large ii • ci^gf mf >i mf df-nce which could be linked to the outliers seen in the residuals, but not necessarily so. However, we are confident that we do not have multicollinearity, which is at least one thing going for this model at this stage. Clearly, quite a lot mure work needs lo tie done before we obtain a more satisfactory model,
LOGISTIC REGRESSION
1 m, .tic •tit-' moii (ako , illt'11 '1 it oi, to distinguish it from othci tic ft itf-con^d depuideni » unhifs, canary logit or binary k<gLti- regno" ui  is tiird roi ittue'-sioi! with a dichotomous
rpcrt'iei t itia'di St ttt s io<ji c omiii tnd has the same genera! KHii'il as tigress, luc Qepencieiii variable should lie a 0/1 i'lrhTumy; for analytic purposes a 0 is referred to as a failure and I is i success, regardless of the substantive meaning of the en ib'e^ for more discussion on the details and application of ioge in resitession, see Long (1997), Long and Freese (2006), or Mrnari (2002).
Main users prefer the logistic command to logit. Results ne the same regardless of which you use, but the logistic com maud i'ports odds ratios (Box 8.6) rather than logit coefficients by di fault.
fn this example, we wall look at the outcome of whether or not a pet son has a first degree or higher, derived from the variable •\l>u (ioi the variable educ, higher degree = 1, first degree = 2). Th'aefoir, to construct die dichotomous or binary vanabic:
recede ettac  (1/2=1)   (3/max=0},gen.(degree)
sve will also use die whole sample of the example data. So, if we wi-re following on from the above example looking at income as
304 Regression
the dependent variable in a sample of those working, we would need to open the data again. And not forgetting to recede the missing values as well!
As the sample contains people aged 16" and older, it is unlikely that the younger people in the sample would have had the opportunity to gain a degree so we'll restrict this analysis to those aged 25 and older by using the command
drop if age<25
First, we will examine if sex and age are determinants of having a degree:
xi:logit degree I.sex age
Ili ex. 1-2
Luiaiiy  coded;   . Isex.1
T I.(.!■ rdl. i on I Lex&Lion Iteration 2 : Iteration Ideration 1:
"I og "I i kelihood
log likelihood
log likelihood
log likelihood
log likelihood
-2301 . 563 -2189.489 -2180.bibb -2180 . bObJS 2180.5053
Logistic regression
Log likelihood
Number of obs - 839C
LK chi2[2) = 242 12
Prob > chi2 - 0.OOuC
Pseudo R2 -■■ 0 .0!
_Isox_2   !   - . 4637623     .C830406    -5.58  0 .000 age   |   -.0406401     . 003 083 4 -13.1H 0.000 ..con;;   I     -.£-23015     .13811.17    -3.06 0.002
- . G,765189 - .3010057
- . 04(,6H3 5 - . 0345^ , '
693789    -.152401
Compare the output from the logit command above with the
output from the logistic command below. You can see that odds ratios are presented instead of coefficients but the z and p values are identical, as are the model fit statistics reported in the top right-hand panel.
xi:logistic degree i.sex age
Logistic regression 305
decree - . se:/ age
locx.1-2 I'ic.Luiiil,  codec.    _J'?r-^-_l omicT&clj
LR chi/. !2 I    ' -  242. 12
orob > chi/ = 0.3000
degree     Odds Ranc Std, Srr. i t>|z|   L95;s Conf. .nrerval]
__5=:lJ   I .328913     .0222253      5.58   3.000       .=344491 .7400735
ags   1       .3 501746     .0729606  -1.7. IS   3.000       .9543394 .9659949
Using the or option with the logit command will give you the same results as using the logistic command, such as:
logit degree i.s«x age, or
Box S.S: Odds ratios
Odds ratios are sometimes preferred over logit coefficients for their ease of interpretation. Th^ logit coefficient.! report the effect of the independent variable on {he loc,ant!i.T-: of ihu odds of being in the 1 category of the dependent variable compared to being in the 0 category. Or, in our example, c-f fiavsig a ciogroc compared to not having one. So. the logit coefficient for snur in cur example is -•0.464 which v»e would interpret .«y:ng tr:sl women have, on average, 0.484 less of the logarithm ol i!ie- of ds o'< hsvng a degree.
Ocids ratios are the exponential of ttie -ogit coefficient. Here we simply take the exponenfral of the loqrt coefficient using the calculator in Stata:
display «tp (-. iii ) j
. display e5ip(-.464j .62876355
-So, the odds ratio for the variable, sex is 0.63. Govts ratios range from 0 to +», with the value for <io effect bsang equal to 1. This means that odds ratios Ipwgrthan 1 are 'negative eliects araj'odcjs ratios greater than 1 are 'positive* effeers. "''he odds ratio of 0 63 in our example is interpreted as 3?°/o less 'Ikeiy to have a degree. Ah odds ratio of 1.43 would be interpreted as 42 Vb n'ce likely, and ;o or. As you —ay gtrheo .he ranrjc abo'.'c 1 i-j greater man '.hat bcicv ! n+.ii-.h is ooi.reioc tj zuro. w, coiiparo-.g trie i-sgnitude of effects either side of 1 is not straightforward aid needs care.
308 Regression
tr rirl t 14 f >- e results indicate that women are less likely to 1 n d , 'i i ien (odds ratio 0.63) and that as age increases it   lil h '    k l     duced I i -f )    i   it 1-, n/dlil el  thm the
^, i ^    ci lei i r- i  const m   \ith all \ i ie v 1 \ \\   i 1 ev
1 at m tnr r-=   r> i<_h lewer   ( t en \ ent tc uivel n   1<~ is ill • fore possible that there is m nreru t op betv een * l   lid is n that the effect of age varies across sexes, for more detans on interaction effects in logistic rep •>•> < it, s   1 .ctrdi2'"l T<"  11 tt t this by including an lnteravti^n t iHi is dc^rd.•--< jt ha   1 1
xi: logistic degree i.sex*age
i . sex ''7p:'_ lr.aCirct_ ly coaled: „Isex„l otr.i r r ;-.r )
royiuliu leqre&slon X':TiK;e:.'  of obs   ~ SjS'J
troo > chiS       = 0.0000 Log  livelihood-    2176.2S61 Prifiudo K2 ^ O.0S44
deyiee  j  OuxU e,-t:'r» stcj.   Krr. r, e> i z |   '95% Coiil. InlezvalJ
1.2241Sb .^062/7 1.01 0.2?7 77 81 : 104 2.744677: .767/960 .2038777 -ft. 17 0,000 .9607267 .97S1266 .0019062     .0062061  -2.09  O.uu-1      .7.61)0176 .92412-6
I he k ult> indi< air tUat die interaction term (_IsexXage__2) has a ->"tittic tnt loeffhiem which tells us that the effect of age varies
i nio>« seses cm, imm rsely, the effect of sex varies .with age.
1 To get a clearer picture of what this means it's a good idea to graph interaction effects. We can easily do this by using the predict post-estimation command to calculate predicted, or fitted, values.
predict yhat,xb
'J'hen use the pull-down menu:
Graphics —. Twoway graph
Then create two plots in a similar way to that described m Box 6.7 but with the following entries:
Other regression commands 307
Pioti; X axis -- jge, Y axis = ybat, if/in tab - sex=='i Plot 2: X axis = iige, Y axis = yhat, if/in tab - sex==2 Legend tab - select Override default keys and type 1 'men' 2 'women' in the box.
Add titles as you wish.
The graph shows that with increasing age both men and women are less likely to have a degree than younger people. But when looking at the effect of age on each sex, you can see that at age 25 there is little difference in the likelihood of having a degree but that the gap increases as age increases. Therefore, the gender gap increases as age increases, which makes substantive sense from what we know about recent history of university admissions and accessibility. It is worth noting that the Y-axis units are in logit (log of the odds). Compare this with the results shown in Box 8.7.
Predicted values or the log odds of having a degree by sex and age
20 40 60 80 100
age
OTHER REGRESSION COMMANDS
Basic regression commands in Stata generally have the same structure in that the command is followed by the dependent variable and then a list of independent variables. There are many other regression models; if you wish to extend your knowledge of regression models with categorical or count dependent variables, then we recommend you use Long (1997) or Long and freest (2006).
308 Regression
Sox 8.7: Post-estimation eomnwxuis
There are a series of very 'useful post-estimitiDrt comrrtanrjs avail-'-able to download' if your copy of State is web enabled. These are based on the sposcado commands developed by Long and Frees* (2006). Ypu must first mstaif the spostado files which can be found by typing iiiujit spostado. The new box will ■shew a link to-:
spost? _»do from https//www.indiasa.edu/ -jeisoc/ittata
Click on this link which will take you to another new box and then simo'y click where it says click here to install.
Follow the same three steps to install the postgr3 and x.ii commands developed by Michael Mitchell and Phil Erider at UCLA: Academic Technology Services, Statistical Consulting Group. See www:ats.ucfa.edu/stat/stata/ado/arfalysisA
Now remnthe logistic regression using.#ie xi3-prefix:
xi3 r logistic; degree i. sewage
. ...3.3 : Iwmti,- J^-im:.' i.^ex'nge
±   ...     _ ,..ctiuj.oi   _.ic   _i   s_l vr- - j
L->7i3ti~ ^"i^ 1      '-- Jr -
T.F cLi? (3) - 250.il p^jjo ^ - j.inKII
r j iut-iil,<   - -i^* ~v_ recuse - ion
coarse i  udds r«Cio 'JCCt. Err. i    T"}-j     V i Tire vqi]
_   | -11V,       "3J1-—   iij   y 2n        ^Sliiri-i 244.--'
acf,-   |       .<ib/.'<>*-       m     "      0 0 ,fi >&U22^ 7S42rr
Then use the posters command to produce a graph of the probability of having a degree by age for both men arid women. The * lines foi men and women are- produced by using the by (sex) option. Try omitting this and see what graph is produced. The graph varies from the one produced by using linear fitted values as this one has probability of having a degree on the Y axis. In sortie ways this is eas:er to understand than fitted logit values. There aie many other ways to use this, and other post-estimation graphing.
Other regression commands 309
commands; for mote information, sea Loag and Fresse (2006) and the.UCLA website, '1 ;
, posr.yx'3 age, by (set-)      - ' Variables left: asis: ac,e _Xse:c__2 _lse2Xag
Here in brief are some of the more common regression commands:
• mlogit - multinomial logit regression for nominal dependent variables with three or more categories. Note that there is not a mlogistic command. Relative risk ratios are reported if the rrr option is used.
• ologit - ordered logit regression for an ordinal dependent variable. Again, there is not an ologistic command, but if you wish to show odds ratios then use the or option.
• probit - binary probit regression. Probit is the other main method for analysing binary dependent variables. Whereas logit (or logistic) regression is based on log odds, probit uses the cumulative normal probability distribution.
• mpiobit - multinomial probit regression. Probit for nominal dependent variables with three or more categories.
• oprobit - ordered probit regression. Probit for an ordinal dependent variable.
• poisson - Poisson regression for a count (non-negative integers) dependent variable.
ixbreg - negative binomial regression tor a count variable that is overdispersed. A Poisson distribution is a special case of the negative binomial tamily, and a dependent variable with a true Poisson distribution can also be estimated using the jjbreg command.
Sox 3.8: Weighing
If your OoM i» simofy -weighted then Sta*o can tis„ t v> v e gnts in e number o! ways, depenoirg on how and why the weights were constructed - see help weigbt vVe-ghT'iiy is a oomplicaied (and controversial) issue and it is beyond the scope of this book to go into the whys and wherefores of f Briefly, you can weight fah'esand most usiirnatnr. prcKedtses by iddmq , weight ppber to the conmand lue .n «t;u=tr*- b;ackcs Hare v,e soo<" two sv arrples of using weights ,n a crosstabulatioa ano 11 a regression model. The weighting variable is weight.
>"i;rosr gh^ocal* x-sajj air* i.rmaatat  [j>w-*waightj
R«-w 8,9: finpMrM'C^n of iwi"".icn tnidel'Ing m a research project
In a seiies of analyses using nan from the First wave ot the Canadian Nateae! Longitudinal Sur/ey ot Childien ark". Youth (NLSCr) »"e weis i"tcro- ted ,n the is'rots as»0" laird v/.th b.rtli outcomes flow birthweight, preterm birth and small for gestational age! and then how birth outcomes affected motor and social development in very young children.
In the hrsi analysis we had Voe dichotomies btithweiqhi outcomes 1 oft oirthweigls (LBVV) war defined as »hose cSrIJ'-n born weighing less than 2500 g, ptete'n. 'r»r!ti was Dirth at 258 days' gestation or lea: and small lor gestational at,e !SQA} was. defined as those under the "0T| percentile of the gestational grcjv'h e irve<- .oe^-t d< l,nfr .% outcome-, "se'e * e oe, th der>t variables. 'P a ssiiea of logrst'C regression inodeir. w:rh soda!, env:roran;r.tal anrl mother's behavioural variables as* ndepcrtdf n* varahles, vVe p'eseoref1 cur results a° udeb rat ob ris v\f r areoor-E*d 3l' of cu' (ndop°hdhi>* vanablts and use-! durriny vauables ?r
Demonstration exercise    311
they could shov; either art .ncease'd or decreased ríšu c LBWoi 3-3A companari la the reference category of the íňdep-'-d-ent vari&o.e.
'n the second anatys.s- we examined how birihweight. this iip-e classified as LBW (less than 2500 g) and VU3VV ttass i- ,r> 1500 g; compared to r-ormaL was associated w.th mo a.ri-.ť!-,-c..->í5.-t at =5.":.- no .0 'B :r.c~ta- -ej of the eVtts ca temily and social variables. The motor and social deve) fMSDl scale used in th<= analysis was an interval level scale ores>:-d horn a number of items 'see our comments ir Box S.ti whicn was reasonably normally distributed as the scale creation was designed so that the 'average' child tor their age scored i 00. Tf's enabled us to use a series of OLS regression models to "6 the affects of the ;ndeperiden! variables on the MSD so; jsed 1 lasted models to .osí Ux tnediattňo. effects and also it"a 0 foi inlei actions {or moderating effects; see this chapter ant •" . ter 91 VVe tound ttiar Sheie was a significant interaction betwoor mother's education ano' turf hweighl1 which indicated that I he Ic,-j        birth weight children with higher educated mothers had V-c-.i-ii" MSD scores of about 100, white low birtftweight ohildrcr with mothers with krwer education had MSD scoies less tharj 90. vVe presented this interaction as a graph to bettei convey the motls: anr.g effects ot hu thAeiýľU and education on the MSD scale
í ' Pcvslin, D.I., Wide, 1 J , Branmaon, ŕ ind Sauve ■» f?Q011 Beyond b Dice
Tie social content of prcriatai behaviour and birth outcomes. Social snďF /tíeoic/ae, 4b- 2U3-ySÔ.
Pňvitiii. nj„ Wade, fj s.nd Biannígan A. (2003) Parental asses f^urly ' hildh03d development: Biobqiral an<
,: Oeveiopmenr, 5 2: 167- !Va
DEMONSTRATION EXERCISE
, Iii , iptet ' vi ni iiupulated the individual level variables and iHiil a new uita s.t called demodatal.dta. In Chapter 4 we d i household level variable indicating the region of the
I mi, ti i onto the unlit idual level data and saved the data with a n<\ h 1P1 diuiofiti7 dta. In Chapter 5 we examined the variables ."  1 ' nunc 'in th< r distribution, measures of central tendency
I me, 11 i nitinuou variables, their normality. In Chapter 6 we e^ inuiiri J t ro- sei in mean GI h.) scale scores across groups in the
312 Regression
factors but did not formally test for differences. The dichotomous indicator was tested using the tab command and measures of association. Correlations between the GITQ scale and interval level factors were produced. In Chapter 7 we formally tested for ditfeienc.es of mean CHQ scores and proportions above the threshold of the dichotomous GHQ indicator between groups.
At this stage of this demonstration we use multivariate OLS regression with, the GHQ scale as the dependent variable and then use multivariate binary logistic regression with the dichotomous GHQ indicator as the dependent variable, in these models we use all of the factors we are interested in to assess their net effects on mental well-being.
In this first regression model we use the xi: prefix as we have a number of categorical independent variables which need to be converted into indicator or dummy variables. We also use the age categories to sec if the association with age is linear or non-linear.
xisreg ghqscalc female i.agecat 1.marst2 ///
i.empstat i.rrumcfad i.region2
In the output below we have put the significant coefficients in bold for easier identification. These results indicate that women have on average higher GHQ scores by J.05 points. The second age category (33-50 years) has significantly higher GHQ scores than the reference (youngest) category (18-32 years), whereas the third category (51-65 years) is not significantly different from the reference category. This suggests that the association is non-linear and possibly could be better defined with a quadratic term for age. The dummy variables tor marital status categories show that those who are married are not significantly different from the reference category (single) but those who are separated or divorced (category 3) and widowed (category 4) have significantly higher GHQ scores than those who are single. Most of the people ill this sample are married, so it may he more appropriate to use the married category as the reference, and wc will change this in the next regression model. For employment status, three of the categories (unemployed, long term sick and familv care) have significantly higher GHQ scores than the reference category (employed). Those with one or two children in the household have significantly higher GHQ scores than those wdth no children, but those with three or more children are not significantly different from those with no children.
Demonstration exercise 313
	.ag---r. ■	■=r.:   - . er^s""   / / /
		
idyiäCiät.. 1 -3	iraturöi.v	c oded;  _Iage.cai._l  i-ji.i tucid
línarsL2_l-4		coded;   _Imars--2_l   jr.j l.-.s:3
Iempr;tat_l -6	inr-tij-ó.lly	coded;  _Ieir.ps-a.L . 1 om . 7„t-s
TnuTTchr=...l-3	('natural ly	coded;  mIiijmchd . 1 orr.i~r.rid
Ir;;giori2._"l ■ 7	(naturally	coceci;  „Ireei j:i2 _ 1 omirtc
Model I lb'j42.5628 19 Residual   I   1 -:bb ,'b . 003 ^663
Tctal   I 190
of obs
66a :■
'.'"688 J i . Ü 4 0 . CO00
: . 0776 4 . 7851
Coef.   Std. Err.
femala _Xagecat_2
_Ii"arst3„3 _Imarst2_.il _Iampstat_2 _Ie--pstat_~
_TejTipstac_'3 _Iampstat_5
_Ienpstat_6 _'XTXuraOtid_2
_Inuxnc'hd 3 _Irsqií;;i".: .Z
1.051762 .3340761
1.597959 1.489015 2.912573 5.155082
1.123951
.4985591
. 11Z9439 .13 6=096
8.92 0.000 2.44 0.015
6.13 0.00Q
3.63 0.000
12.76 0.000
15.74 0.000
".73 0 . C 8 4
5-90 0,000
-S205GC.1 .0661272
val ] 296 =
■eqiíónO 5
There are no significant differences for the dummy variables for region of the country compared to the reference category of London. However, if you examine the coefficients more closely you can see that category 6' (Wales) is 0.417 higher than the reference category and category 4 (Northwest) is 0.292 lower than the reference category. This difference might be significant, but is not tested in this model. We can, however, test this with a post-estimation command:
test _iregioa2_6~ __Irsgion2_4
314 Regression
.   :esc __region2_6-  __ireg i.'jn2 _4
(  1)   - _Zreglon2_4 -i  . ,~regicn2__6 = 0
F (  1,   76 6 8)   = 5.84 Prob > F - 0.C157
The output above tests for a difference between the two coefficients and the p. value of the test is less than 0.05 which suggests that they are different. However, for a variable such as region! we might want to see if the regions are significantly different from rite overall sample mean rather than choose a reference category.
From our observations above, we need to adjust some of the variables and commands to re-estimate this regression model. First, we wish to capture the non-linear nature of the association between age and GHQ score by adding a squared age term to the model. Therefore, we need to create a new variable for the squared value of age.
gen age2=ageA2
Next, we want to change the reference categoty lor maritai status to married (category ?.).
char marst2  [omit] 2
Finally, we want the coefficients for the region categories to be compared to the overall or grand mean in the sample. To do this we need to have downloaded the xi3 command/prefix (see Box 8.7'). Dummy variables that indicate differences from the grand mean are usually referred to as effect coding, and this is done by prefixing the regression command with xi3: and then prefixing the region variable with (rather than So, the new regtes-sion looks like this:
xi3:reg ghqscale female age age2 i.marst2 /// i.empstat i.numchd e.region2
.I::.iiia22 2 G:.dLl:e~; T.emoi2aL„l jmiLted .Ln-.;m-hd_! oni r.rod: 2ieyicn2   1 oruiLled
.  xi.o.'ey gtiqr.o;Cle  ;e~alc b.jc age2  i.ma:-:st2 /// i.eE.psld'- i..a_melld e.j:eqio:i2
.yalj:r,2 TpTT'p;:tat   1-G Maturally coded; _
:. . n„LY:^:iG        _InjiiichLi_l~3 .:ia2uial2y ccd.^d;
o.rogionz Irer/ior22 i. . ::ia^atal2v coded;
Demonstration exercise 315
Source	?£■	Sf		'AS		Number of obs		7 68 8
						t {	19, 7668)	.35.66
Kcde_	Ib491 - 6833	19 sib.	3bl7b4			?r	ob > ¥	- 0.0 0CO
Residua-	175325.833	766:-! 22.8	64 L	■163		R-	sparse	-- 0.0S12
------............						Ad	i ^.-squared	= 0.0789
TctcU	I9QS1 / .666	7657 24.8				Koct I'S'iL		- 4.7817
ghqscalc	Cocf.	3rd. Err.		t	P>|,		;95% c:onf.	ntsrval]
female	1.045367	. 1.1793 95	8	as	a	boo	.814173	] .27 656
age		.0342705	4	38	0	000	.083067	.2174259
age 2	-.0018732	. 000422	-4	44	0	aoo	-.00-7005	-■ . 001U46
__Imar s 12 „1	-.U4644/9	.1816749	■0	26	3	798	-.4025804	. 3096846
Imarst2_3	1.415416	.2165824	6	54	0	0QO	.990855	1 . 83997'/
_Iraarst.2_4	1.438633	.3810207	3	78	0	000	.6917286	2.135538
_Jenjpstat_2	2.973164	.2288571	12	39	0		2 ,b24b41	3 . 421786
_Iempstat._3	5.182441	.3276309	.15	82	0	000	4.640391	b.824491
Iempstat.4	.7544568		2	18	0	829	,0/b2034	1.43371
Iempstat b	1.166145	.1902912	©	13	0	000	.7931224	1. b391bd
v     * _	.1311709	.3059574	0	43	0	668	-.468=892	.730931
lnumchd_^	.4154971	.1458797	2	85	0	004	.129633	.7014611
Iiiufiichd 3	.2894564	-2449152	1	18	0	237	-.1906445	. 7 6 95 572
. : '.;<.i .i :>n,;	- .1058743	.1047368	... |	01	0	312	- . 3111871	.0994386
	- .064'072	.11-73939	(i	50	0	615	- . 3138341	. 1.3561 97
	-. 196559	.1 55-. 6 97	1		0	2 05	-.503734	.11,76;61
■ rcyri r:■ ti     ■ i	- .041:: 748	. 1305821	■ fi	32	0	753	- .2971515	.2148019
_Iregions_o	.5126046	1   (I i	1	41	0	016	. 0948849	.9303244
_ ill"	- . 1 743432	U ()<■    ii	... |	08	0	379	- . 4897504	.141064
	6 .733.737	. 665037	"10			CfiO	5.49108?.	R .C98.3S1
The significant coefficients for both the age and age'2 variables indicate that we were correct ro model a non-linear association, and the positive coefficient for the age variable and the negative coefficient for the age'2 variable show that the association first increases with age and then decreases in an inverted U shape.
The categories of marital status show that those separated or divorced and those widowed have significantly higher GHQ
1 scores than those who are married. If you examine the dummy variables for marital status you can see that now category 2
I     {.....Imarst2_2) is missing and is therefore the reference category.
The dummy variables for employment status were not altered but you can see that the coefficient for category 4 \_Jempstat_4)
I is now significant, which it wasn't in the first regression model. The coefficient is larger, which may have resulted from the better specification of other variables in the model.
* The categories of the region! variable now show differences
from the grand mean of the sample. Now they indicate that
t
category 6 Uregion2j$) has significantly higher GHQ scores than the sample average. You can see that even though the coefficients now show difference from the grand mean, one of the categories is still missing. If you wish to find the difference for this categon then you can rerun the regression omitting another category of the region! variable. Omitting category 2 produces the following extract of results:
1
. 45
uo.ao^o is '355t i.i 50 ii.io: ..100241
-lSOOOO .155:60. -1.27 0.205 -.6007i4
.1-1. '10 . Iln?021 -i..V 0 7-7 - 2: il515
512604S .2100926 2.41 0.016 .004084:4
Box 8.TO: Graphing effect coded categorical variables
Effect coding categorical independent variables gives you the opportunity to graph the information in a way that is intuitively attractive and logical. Copy and paste the regression results into an Excel spreadsheet (see Chapter 2) and then add the coefficient and confidence interval for the omitted category from rerunning the regression with another category omitted (by using the e:l-.»c commands. Now you can graph the categories' differences from the grand mean along with the 95% confidence interval, and the resulting graph shows very clearly that, on average, Wales has significantly higher GHQ scores than the sample avet-age after controlling for sex, age, marital status, employment status and number of children. ,
1.2
~-T=
0.8 0.6 C.4 0.2 0
-0,2-
-0.4
-0.8
Regional variations in adjusted GHQ score from , overall average (95% CI bars)
r
Demonstration exercise 317
Alternatively, you can add the coefficient from all the other categories, and then the difference from zero is the omitted coefficient. I-or example, from the extract, 0.06.945 - 0.06410 - 0.19655 -0.04117 + 0.51260 - 0.17434 = 0.I0.5S7. The difference from ii.ro is -0.10587 (as with effect coding all the differences add to itrOy see Box 8.10) which, if you check the previous output, is the ^efficient for category 2 (_Iregionl_2).
Now we run a logistic regression using the binary GHQ indi-•-jtor {d_ghq) and the same independent variables as on p. 312.
xi.3 : logistic d_ghq female i .agecat i .marst2 /// i .exnpstat i . numchd e. region2
■gisU,   d_ghq t
natu-.Vl 1 y coaeu
■nar.uraily coded
naturally cortr:.1
nnt-'ir.Tl '.y cixieU
""lj r.uri-il-y code1:!
_TmarKr.2__2 omitted) ...lGmpatdl._l omitted > __IniLJckä_l omitted) „Xregior?   1 o-dtLed.i
Number LP chi2
cb-s -
339.92
:5 ghq   | odds
ale I
Ml 11
6.15847 r,34019
318 Regression
ilie results of the logistic regression show similar associations to those in the OLS regression models. One noticeable difference is the association with age. In the OLS models the association was non-linear and best captured with a quadratic term, but the above output, using dummy variables for the age categories (shaded), shows a decreasing likelihood of being over the GHQ threshold with age, thus suggesting that a linear term can capture the association. Using the interval-level age variable produced the following coefficient in the logistic regression (other output omitted):
cl .ghq  |   Oetis Sdt:o 3td.   E-r
z    ?>|z(   [05%  ConF. rnrer,'7i]
_.217405 1.67 87 01 .9806373 .9055217
.0 91S5 .00306
0 . 000 0 . 061