http://t0.gstatic.com/images?q=tbn:ANd9GcQb2QuGOvKAur_yqzAgxcofaV0z84NuRsyXu8ZzhaREzoQTvzDy
Lecture 5
Handling missing data
R101: A practical guide to making R your everyday statistical tool (PSY532)

Programme (lecture and seminar)
•Types of missingness: MCAR, MAR and MNAR
–Little’s MCAR test
•Older (but still common) approaches to missing data
•EM single imputation: an improved and commonly used approach available in SPSS
•The advantages of multiple imputation
•Multiple imputation options in R:
─norm package (and its problems)
─mice package (the most commonly used option)
•Planned missingness: designing a study in which each participant answers only a subset of survey
questions
–Multiple imputation can then be used on the data
•Structural equation modelling packages often have their own Full Information Maximum Likelihood
(FIML) algorithm for taking missing data into account when calculating model parameters. We do not
discuss these algorithms here, but they are covered in the readings.

Non-response patterns and types of missingness
Schafer  & Graham 2002
Univariate
Monotone
Arbitrary
MCAR
CoM not related to X or Y
CoM not related to any Y
CoM not related to any Y
MAR
CoM related to X (and possibly Y, but not once X is taken into account)
CoM related  to one or more Ys prior to drop-out
Possibly different set of Ys related to CoM for each participant 1 to N
MNAR
CoM related to Y and possibly X
CoM related to one or more Ys after drop-out (unseen responses)
CoM is related to an unobserved variable, which is related to one or more Ys
Causes of missingness (CoM): Being physically unable to show up, not being sure, concerns about
privacy, etc.

1, 2... N:  Participants
X1, X2... Xp: Predictor variables
Y1, Y2... Yp: Outcome variables/ any variables in arbitrary non-response patterns

•MCAR, pure MAR, and pure MNAR really never exist. Nevertheless, they are useful guidelines for
thinking about the sources and consequences of missing data.
•A number of tests have been proposed for MCAR (see script):
–Little’s MCAR test, using the LittleMCAR function in the BaylorEdPsych package (suitable for
datasets with up to 50 variables)
–a visual approach, using the marginplot function in the VIM package. This approach is also useful
for detecting whether certain variables in the data set predict patterns of missingness, implying
MAR.
–Hawkins test of MCAR available in the MissMech package; not covered here
•The multiple imputation methods in this lecture assume that data is not MNAR (i.e., it has to pass
the test for MCAR or come from a study where the study design ensures against MNAR)
•The multiple imputation methods in the lecture are applied to a univariate non-response pattern,
but are transferable to the monotonic and arbitrary patterns.
In the script, we conduct Little’s MCAR test on a modified version of the SS data set. The version
has a random 5% of data removed for all variables except the experimental condition and
pre-experimental measures. Data is missing for the illusion of control measures, “It was all
chance”, the question about strategy (yes/no), the question about the number of wins in the next
100 rounds, etc. See lines 31-42.
Go to script

Older (but still common) approaches
Reading: All references marked *
Description
Problems
Available case analysis (listwise deletion)
Delete all participants who are missing values on any variable included in an analysis. Default
option when running ANOVAs, regressions etc. in SPSS.
•Possible only with MCAR, or with MAR if the analysis includes the variable relating to the CoM as
a covariate
•Reduces power of the analyses
•Ns are different across analyses
Averaging the available responses
If a person is missing some but not all responses on a scale that has a mean score, calculate that
mean based only on the available items (e.g., the average of 6 items rather than 8)
•May introduce bias under MCAR by adding additional variability to the survey
•Conceptually problematic if some items are more likely to be missing  (i.e., MAR or MNAR), yet the
survey score is redefined as the average of available items rather than defined items

Description
Problems
Mean substitution
Substitute each missing value with the average of other participants’ values on the item
The average of the variable is
preserved, but other aspects of its distribution are altered: variance, quantiles, and correlation
with the responses of other participants.
Hot deck imputation
Substitute each missing value with a value randomly drawn from other participants’ values
Correlations in responses across participants increase.
Regression-based single imputation
First, divide participants into those with a variable Y, and those for whom Y is missing. Then,
estimate a regression model in the first group (X1 , X2 ... Xp predicting Y) and calculate Y for
the second group based on that regression model. Available in SPSS.
•The imputed data points do not depart from the regression line, which makes them different (less
variable) to the observed data points.  SPSS adds some error to each data point to partially
correct for this.
•Unless the Xs and Y are strongly related, the relationship between them is inflated.

EM single imputation
Reading: All except #
•EM single imputation is currently one of the most common approaches to missing data because of its
easy implementation in SPSS.
•
•E-step
•Regression-based single imputation: First, divide participants into those with a variable Y, and
those for whom Y is missing. Then, estimate a regression model in the first group (X1 , X2 ...
Xp predicting Y) and calculate Y for the second group (the “missingness” group) based on the
regression equation.
•
•M-step
•Regression models can also be fitted based on means and estimates of “variance” and “covariance”.
These can be calculated from a data set. At the M-step, these are calculated from the filled-in
data set of the E-step, resulting in a new regression model. There is then another E-step to
calculate Y in the “missingness” group.
•
•The steps are repeated until they produce the same estimates for Y in the “missingness” group
(i.e., “converge”).

Advantages of multiple imputation
Reading: All except #. In particular, Graham (2009).
What is it?
1.Use EM-based or regression-based procedures to generate a number of imputed data sets. Each data
set might contain a different imputed value for each missing value. Researchers tend to generate
anywhere between 5 and 50 data sets, and it has been recommended that the number correspond to the
percentage of missing cases.
2.Conduct your planned analysis (ANOVA, regression, generalized linear model, etc.) on each data
set from Step 1.
3.“MI inference”: Combine the results of the analyses into a pooled result using “Rubin’s rules”
(e.g., one rule is that point estimates for the parameters are simply averaged across analyses).
Its advantages over single EM
•Like single-imputation EM, these procedures overcome the problem of lack of variance in
regression-based single imputation.
•Additionally, they overcome the problem that, in both EM and regression-based single imputation,
the regression model is based on a single sample.

•Cautionary note
•Include as many variables without missing values as possible in the procedure to avoid predicting
missing values based only on the variables used as predictors in the analyses at Step 2.

A diagram of the steps in multiple imputation
From publicly available course material by Stef van Buuren, author of the mi package


Multiple imputation options in R


norm package: Multiple imputation through data augmentation
Reading: All references marked da
•Multiple imputation procedure based on the single-imputation EM approach. •As we shall see in
applying norm to our modified SS data set with 5% of cases missing, the procedure can impute
negative values where the observed values are only positive. This is a problem if the analyses in
Step 2 of the multiple imputation process involve generalized linear modelling with Poisson or
negative binomial random components. These are not suitable for outcome variables with negative
values. •If there are categorical variables among those missing, the mix package needs to be used,
for which you will need the package author’s, J. S. Schafer’s, book “Analysis of Incomplete
Multivariate Data” (1997; Chapters 7-9).
•The online manual for the package is also very useful.
•
Go to script

mice package: Multiple imputation through chained equations
Reading: All references marked ce
•Depending on whether the scale of the variable with missing values is numeric or categorical
(factor-based), this procedure uses one of the following regression methods to impute missing
values. The pmm and logreg methods are defaults for numeric and categorical variables respectively.
Most of the regression methods are Bayesian and all the other variables in the data set (even those
with missing values) act as predictors.
Go to script

Reading
•Azur, M. J., Stuart, E. A., Frangakis, C., Leaf, P. J. (2011). Multiple imputation by chained
equations: what is it and how does it work? International Journal of Methods in Psychiatric
Research, 20, 40-49. Will be made available online as course material. ce
•
•Graham, J. W. (2012). Missing Data: Analysis and Design. Springer Science+Business Media: New
York. Chapter 2 “Analysis of missing data”. Available as “sample pages” online here:
http://www.springer.com/statistics/social+sciences+%26+law/book/978-1-4614-4017-8 * da
•
•Graham, J. W. (2009). Missing data analysis: Making it work in the real world. Annual Review of
Psychology, 60, 549-576. Will be made available online as course material. * da
•
•Graham, J. W., Cumsille, P. E. and Elek-Fisk, E. (2003). Methods for handling missing data.
Handbook of Psychology. Volume One, 87–114. Available as a html file through a Masaryk University
computer: http://onlinelibrary.wiley.com/doi/10.1002/0471264385.wei0204/full * da
•
•Little, T. D., & Rhemtulla, M. (2013). Planned missing data designs for developmental researchers.
Child Development Perspectives, 7, 199–204. Will be made available online as course material. #
•
•Schafer, J. L., & Graham, J. W. (2002). Missing data: Our view of the state of the art.
Psychological Methods,7, 147–177. Will be made available online as course material. * da
•
•van Buuren, S. & Groothuis-Oudshoorn, K. (2011). mice: Multivariate imputation by chained
equations in R. Journal of Statistical Software, 45. ce