Michael Underwood
Elizabeth Subject-Scott
Javier Estrada
X are the completely observed variables.
Y are the partly missing variables.
Z is the component of the cause of missingness unrelated to X and Y.
R is the missingness.
When type is MCAR and the amount of missing data is small, deletion can be used.
2 Types
Listwise deletion occurs when the entire observation is removed.
Pairwise deletion occurs when the variable of an observation is removed.
Deleting missing data can lead to the loss of important information regarding your dataset and is not recommended.
Imputation
2 Types
Single Imputation
Multiple Imputation
Methods include:
Regression Imputation is based on a linear regression model. Missing data is randomly drawn from a conditional distribution when variables are continuous and from a logistic regression model when they are categorical.
Predictive Mean Matching is also based on a linear regression model. The approach is the same as regression imputation except instead of random draws from a conditional distribution, missing values are based on predicted values of the outcome variable.
Hot Deck (HD) imputation is when a missing value is replaced by an observed response of a similar unit, also known as the donor. It can be either random or deterministic, which is based on a metric or value. It does not rely on model fitting.
Stochastic Regression (SR) Imputation is an extension of regression imputation. The process is the same but a residual term from the normal distribution of the regression of the predictor outcome is added to the imputed value. This maintains the variability of the data.
\[{\overline{\theta}}_{M} = \displaystyle \frac{1}{M}\sum_{m = 1}^{M} {\hat{\theta}}_{m}\]
\[{\hat{\Phi}}_{M} = {\overline{\phi}}_{M} + (1+\displaystyle \frac{1}{M})B_{M}\]
This is the total variance and is made up of two parts: the average within imputation variance and the between imputation variance (with a correction factor).
The average within imputation variance (normal standard error) is:
\[{\overline{\phi}}_{M} = \displaystyle \frac{1}{M}\sum_{m=1}^{M}{SE}^2\]
(It will be large in small samples and small in large samples.)
\[B_{M} = \displaystyle \frac{1}{M-1}\sum_{m = 1}^{M}({\hat{\theta}}_{m}-{\overline{\theta}}_{m})^{2}\]
When the between variance is greater than the within variance, then greater efficiency occurs and more accurate estimates can be achieved by increasing M.
When the within variance is greater than the between variance, then little is gained by increasing M.
To calculate the total standard error:
\[SE_{pooled} = {\sqrt{\hat{\Phi}}_{M}}\]
\[Wald_{pooled} = \frac{(\overline{\theta} - {\theta_{0}})^2}{{\hat{\Phi}}_{M}}\]
\[df = (M-1)(1+\frac{1}{M+1} * \frac{\overline{\theta}_{M}}{B_{M}})^2\]
\[t_{df, 1-\alpha/2}\]
\[ F_{1,df} = t^{2}_{df, 1-\alpha/2}\]
Observed data follow a normal distribution. If not normal, a transformation is first needed.
Missing data are classified as MAR, which is the probability that a missing value depends only on observed values and not unobserved values.
The parameters \({\theta}\) of the data model and the parameters \({\phi}\) of the model for the missing values are distinct. That is, knowing the values of \({\theta}\) does not provide any information about \({\phi}\).
Step 1: Impute missing data
Step 2: Run regression models on all imputation sets
Step 3: Pool regression results into one regression result
MICE = Multivariate Imputation by Chained Equations.
mice() method in R.
The method creates multiple imputations (replacement values) for multivariate missing data.
The MICE algorithm can impute mixes of continuous, binary, unordered categorical and ordered categorical data.
This can be done using the imputation regressions of the mice() method such as:
predictive mean matching (numeric data)
logistic regression imputation (binary data with two factorial levels)
polytomous regression imputation for unordered categorical data (factor > 2 levels)
proportional odds model for (ordered, > 2 levels).
King County, Seattle Home Sale Prices between 2014 and 2015
Contains the sale prices of 21,613 houses
The original dataset contained 21 columns with various selling attributes.
For the purpose of this project, we have condensed the variables to the following 4:
price bedrooms bathrooms sqft_living
0 0 0 0
Next, we randomly replace the desired amount of values with NA to mimic missing data.
We will assign 200 NA values to the following variables: - bedrooms, - bathrooms, - sqft_living
And 100 NA values to the price variable.
price bedrooms bathrooms sqft_living
100 200 200 200
library(VIM)
aggr_plot <- aggr(house, col=c('navyblue','red'), numbers=TRUE, sortVars=TRUE, labels=names(data), cex.axis=.7, gap=3, ylab=c("Histogram of missing data","Pattern"))
Variables sorted by number of missings:
Variable Count
bedrooms 0.009253690
bathrooms 0.009253690
sqft_living 0.009253690
price 0.004626845
Since we are missing about 3% of our data, we need to perform at least 3 imputations. This will be done using the mice() function:
Since 5 is the default, we will use that (the m parameter can be used to adjust the the # of imputations)
The set.seed will be given the value 1337 (any number can be used here) to retrieve the same results each time the multiple imputation is performed.
# impute the data 5 times (default)
imp = mice(data = house, set.seed = 1337, defaultMethod = c("pmm", "logreg", "polyreg", "norm", "polr"))
iter imp variable
1 1 price bedrooms bathrooms sqft_living
1 2 price bedrooms bathrooms sqft_living
1 3 price bedrooms bathrooms sqft_living
1 4 price bedrooms bathrooms sqft_living
1 5 price bedrooms bathrooms sqft_living
2 1 price bedrooms bathrooms sqft_living
2 2 price bedrooms bathrooms sqft_living
2 3 price bedrooms bathrooms sqft_living
2 4 price bedrooms bathrooms sqft_living
2 5 price bedrooms bathrooms sqft_living
3 1 price bedrooms bathrooms sqft_living
3 2 price bedrooms bathrooms sqft_living
3 3 price bedrooms bathrooms sqft_living
3 4 price bedrooms bathrooms sqft_living
3 5 price bedrooms bathrooms sqft_living
4 1 price bedrooms bathrooms sqft_living
4 2 price bedrooms bathrooms sqft_living
4 3 price bedrooms bathrooms sqft_living
4 4 price bedrooms bathrooms sqft_living
4 5 price bedrooms bathrooms sqft_living
5 1 price bedrooms bathrooms sqft_living
5 2 price bedrooms bathrooms sqft_living
5 3 price bedrooms bathrooms sqft_living
5 4 price bedrooms bathrooms sqft_living
5 5 price bedrooms bathrooms sqft_living
We can check the quality of the imputations by running a strip plot, which is a single axis scatter plot.
It will show the distribution of the imputed values over the observed values.
Blue values are observed value and red values are imputed values. Ideally, we want the imputations to be values that could have been observed had the data not been missing.
Next, we will pool the results from our regressions to arrive at estimates that will properly account for the missing data.
It will give us the estimate, standard error, test statistic, degrees of freedom, and the p-value for each variable.
term estimate std.error statistic df p.value
1 (Intercept) 75043.7775 6952.563287 10.793685 14431.278 4.677205e-27
2 bedrooms -57930.4966 2357.322483 -24.574702 7683.745 1.936143e-128
3 bathrooms 7624.2182 3531.505441 2.158914 11516.540 3.087739e-02
4 sqft_living 309.7992 3.114475 99.470779 8431.335 0.000000e+00
# fit original, complete dataset
og_lm = lm(price~bedrooms+bathrooms+sqft_living, data = original)
# compare imputed dataset to the original dataset
summary(list("Imputed" = pool(fit), "Original" = og_lm), title = "Comparison", statistic = "p.value", estimate = "estimate", gof_omit = 'IC|Log|Adj|F|RMSE') Length Class Mode
Imputed 4 mipo list
Original 12 lm list
The estimates from the multiple imputation are very close to the estimates for the original dataset. The p-values are similar and the differences are minimal with exception of bathrooms which is off just slightly.
These results indicate a high accuracy of the imputation process. We can conclude that multiple imputation by chained equation is a reliable source to impute missing data.